Are split files filtered?

lydiayliu commented 1 year ago

So I'm trying to do as many things in a meta pipeline run as possible, here is the config specific to callNoncanonical

You can see that I want to do 3 things

parse raw files
filter fasta
split fasta


        mopepgen_version = '0.11.3'
        docker_image_moPepGen = "ghcr.io/uclahs-cds/mopepgen:0.11.3"

        entrypoint = 'parser'
        filter_fasta = true
        split_fasta = true
        encode_fasta = false
        decoy_fasta = false

        index_dir = '/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/ref/GRCh38-EBI-GENCODE34/index/'

        noncoding_peptides = '/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/ref/GRCh38-EBI-GENCODE34/noncoding/min_default.fa'

        merge_variant_noncoding = 'both'

        parseSTARFusion {

            min_est_j = 0 // field missing in CCLE

        }

        callVariant {

            max_variants_per_node = 7  // default 7
            cleavage_rule = 'trypsin'  // default trypsin
            miscleavage = 2 // default 2
            min_mw = 500.0 // default 500.0
            min_length = 7 // default 7
            max_length = 25 // default 25

        }

        filterFasta {

            variant_peptides {
                skip_lines = 1
                tx_id_col = 1
                quant_col = 2
                quant_cutoff = 0.01
                keep_all_coding = false
            }
            noncoding_peptides {
                skip_lines = 1
                tx_id_col = 1
                quant_col = 2
                quant_cutoff = 0.01
                keep_all_coding = false
            }
            merged_peptides {
                skip_lines = 1
                tx_id_col = 1
                quant_col = 2
                quant_cutoff = 0.01
                keep_all_coding = false
            }

        }

        splitFasta {

            order_source = 'Mutation,Fusion,Coding,Noncoding'
            group_source = 'Coding:Mutation,Fusion'
            max_source_groups = 1
            additional_split = 'Noncoding'

        }

        summarizeFasta {

            order_source = 'Mutation,Fusion,Coding,Noncoding,Mutation-Fusion,Mutation-Noncoding,Fusion-Noncoding,Mutation-Fusion-Noncoding'
            ignore_missing_source = true

        }

Here's an example output sample. The GVFs are fine, and the variant and merged fasta files are fine with the summary files

yiyangliu@ip-0A125227:/hot/project/method/AlgorithmDevelopment/ALGO-000074-moPepGen/CCLE/processed/noncanonical-database/call-nonCanonicalPeptide/2023-01-28/pipeline-meta-call-NonCanonicalPeptide-0.0.1/ACH-000738/call-NonCanonicalPeptide-1.0.0/ACH-000738/moPepGen-0.11.3/output$ ls *
ACH-000738_Fusion_STARFusion.gvf                 ACH-000738_Mutation_VEP.gvf                   ACH-000738_variant_peptides_filtered_summary.txt
ACH-000738_merged_peptides.fasta                 ACH-000738_noncoding_peptides_filtered.fasta  ACH-000738_variant_peptides_summary.txt
ACH-000738_merged_peptides_filtered.fasta        ACH-000738_variant_peptides.fasta
ACH-000738_merged_peptides_filtered_summary.txt  ACH-000738_variant_peptides_filtered.fasta

But in the split directory, I just got

split:ACH-000738_Coding.fasta  ACH-000738_Noncoding-additional.fasta  ACH-000738_Noncoding.fasta

Are these split files filtered or un-filtered? Also, should there be a merged_peptides_summary.txt?

Also I think in many ways this is related to #64

zhuchcn commented 1 year ago

The split fasta files are filtered as long as you have the filterFasta namespace in the config file.

lydiayliu commented 1 year ago

But why don't we rename the split filtered fastas from the unfiltered ones? This was my thing in #64 as well, where samples that don't have expression table still output unfiltered split fastas, that are the same name as the samples with expression table and thus have filtered split fastas

Should there be a merged_peptides_summary.txt?

zhuchcn commented 1 year ago

But why don't we rename the split filtered fastas from the unfiltered ones? This was my thing in https://github.com/uclahs-cds/pipeline-call-NonCanonicalPeptide/issues/64 as well, where samples that don't have expression table still output unfiltered split fastas, that are the same name as the samples with expression table and thus have filtered split fastas

That makes sense. Sorry I missed #64! I can add a label of filtered/split/splitFiltered/merged/mergedFiltered. Will we have variant peptide filtered but not noncoding peptides?

merged_peptides_summary.txt

Sounds good!

lydiayliu commented 1 year ago

i think something like a split_filtered directory could be great!

If the user didn't input noncoding peptides then there won't be any filtered ones.

It would be great if you could add this and I'll test it out on CCLE again! I think this is the most complicated way I've tried to run this pipeline so far, I use to do the splitting, and spliting + filtering in separate runs

zhuchcn commented 1 year ago

I like that idea. It makes it a little easier.

uclahs-cds / pipeline-call-NonCanonicalPeptide

Are split files filtered? #80