filter fasta by transcript abundance

lydiayliu commented 2 years ago

This is definitely a bucket list item for me. It would be fantanstic if we can add a filter to remove from the noncanonical transcript databases transcripts that are not found in a sample (for example from the result of an aligner or RNA quantification algorithm). Should probably be a separate command line tool. Of course can be extended to protein coding transcripts as well!

zhuchcn commented 2 years ago

Where do you think we should put this filter? Maybe as part of the splitDatabase?

lydiayliu commented 2 years ago

Actually, that might be a pretty good idea!

zhuchcn commented 2 years ago

It might worth to also implement an individual subcommand, so it would be useful when people only run VEP data, and don't want to split at all. So this gives users a little more flexibility. Any thoughts?

lydiayliu commented 2 years ago

i dont really want a subcommand with redundant functionality? It think either put it in splitDatabase or have a subcommand

If in splitDatabase and people only run VEP, there's nothing to split right and it will just be the filter functionality?

If as a subcommand, could be annoying to have to link all of the split fastas, but at the same time can filter on only select fastas. For example I might only want to filter for noncoding transcript abundance, and want to include all coding transcript variants regardless of coding transcript abundance

zhuchcn commented 2 years ago

I see your point. If this is a subcommand, we can run it on the full FASTA, and then run splitDatabase on the filtered FASTA? That might make it more clean. So split only does spliting, and filter only does filtering

lydiayliu commented 2 years ago

that's a good point haha. so subcommand it is?

zhuchcn commented 2 years ago

OK, subcommand it is! In terms of the actual filtering, we will take the quantification output, for example RSEM output, and filter out genes with the expression level smaller than the cut out. You also mentioned above that we can filter by biotype? Should we do something like filter_vep to use some kind of query string?

lydiayliu commented 2 years ago

Would a query string be too complicated?

So far

input trasncript abundance
format of transcript abundance, or names of columns where transcript_id and matching "abundance" is given
level of cutoff
inclusion / exclusion biotype files?

zhuchcn commented 2 years ago

What if, as you mentioned, we want all the noncoding, but coding only above certain cutoff? Using individual args, this would be something like:

moPepGen filterPeptides \
    --noncanonical-peptide xxx_moPepGen.fasta \
    --rnaseq-quant rsem.txt \
    --quant-type FPKM \
    --quant-cutoff 5 \
    --biotype-filter psudogene,lncRNA,... \
    --biotype-keep protein_coding \
    --biotype-drop xxx

With query string, it would be like this:

moPepGen filterPeptides \
    --noncanonical-peptide xxx_moPepGen.fasta \
    --rnaseq-quant rsem.txt \
    --query "(biotype = protein_coding) | (FPKM > 5  & ! biotype in type1,type2,type3 )"

But I agree that it is too complicated to implement. I probably don't want to do it right now..

lydiayliu commented 2 years ago

hahaha i agree it looks great. but i have no idea how to implement that cleanly!

I think it wouldnt be too much to ask the user to provide a list of transcripts that they care to filter and a clean column of numbers to filter based on :P

zhuchcn commented 2 years ago

Does this look like a good user interface? We can make the --biotype-keep and --biotype--drop optional. Seems like wo don't need the --biotype-filter? If it's not --biotype-keep or --biotype-drop, it must be something want to be filtered for gene expression?

moPepGen filterPeptides \
    --noncanonical-peptide xxx_moPepGen.fasta \
    --rnaseq-quant rsem.txt \
    --quant-type FPKM \
    --quant-cutoff 5 \
    --biotype-filter psudogene,lncRNA,... \
    --biotype-keep protein_coding \
    --biotype-drop xxx

lydiayliu commented 2 years ago

the biotype stuff reminds me of that biotype file that we have for determining what transcript biotypes count as noncoding. can we just use something like that? i think the biotypes are too numerous for listing on command line?

zhuchcn commented 2 years ago

Maybe we can have --keep-all-coding, --keep-all-noncoding and '--drop-all-coding', '--drop-all-noncoding'? They are all optional and default to False, so all genes will be filtered according to gene expression?

lydiayliu commented 2 years ago

sorry what is the difference between 'keep' and 'drop'?

keep is only filter these and drop is filter everything except these? Or the otherway around?

I think we can just have a --filter_coding --filtering_noncoding that are both set to TRUE? If one is set to false, don't filter in that group and keep all peptides. If both are set to false, do nothing

zhuchcn commented 2 years ago

Oh OK, that seems much easier. If some one would like to filter out all noncoding variant peptides (that's what drop means), they can just do that by using splieDatabase

zhuchcn commented 2 years ago

For fusion, should both transcripts above the given cutoff or at least one?

lydiayliu commented 2 years ago

haha great question. Personally I would skip fusions and circRNAs by default, cuz they have internal filters when parsed

zhuchcn commented 2 years ago

Should we have --filter-fusion and --filter-circ-rna by default as False? Should we even make it possible for users to filter those?

lydiayliu commented 2 years ago

I don't think these should even be options :P Becuase the filters do exist XD

zhuchcn commented 2 years ago

Ok, sounds good to me!

uclahs-cds / package-moPepGen

filter fasta by transcript abundance #133