Closed lydiayliu closed 2 years ago
Where do you think we should put this filter? Maybe as part of the splitDatabase
?
Actually, that might be a pretty good idea!
It might worth to also implement an individual subcommand, so it would be useful when people only run VEP data, and don't want to split at all. So this gives users a little more flexibility. Any thoughts?
i dont really want a subcommand with redundant functionality? It think either put it in splitDatabase or have a subcommand
If in splitDatabase and people only run VEP, there's nothing to split right and it will just be the filter functionality?
If as a subcommand, could be annoying to have to link all of the split fastas, but at the same time can filter on only select fastas. For example I might only want to filter for noncoding transcript abundance, and want to include all coding transcript variants regardless of coding transcript abundance
I see your point. If this is a subcommand, we can run it on the full FASTA, and then run splitDatabase on the filtered FASTA? That might make it more clean. So split only does spliting, and filter only does filtering
that's a good point haha. so subcommand it is?
OK, subcommand it is! In terms of the actual filtering, we will take the quantification output, for example RSEM output, and filter out genes with the expression level smaller than the cut out. You also mentioned above that we can filter by biotype? Should we do something like filter_vep
to use some kind of query string?
Would a query string be too complicated?
So far
What if, as you mentioned, we want all the noncoding, but coding only above certain cutoff? Using individual args, this would be something like:
moPepGen filterPeptides \
--noncanonical-peptide xxx_moPepGen.fasta \
--rnaseq-quant rsem.txt \
--quant-type FPKM \
--quant-cutoff 5 \
--biotype-filter psudogene,lncRNA,... \
--biotype-keep protein_coding \
--biotype-drop xxx
With query string, it would be like this:
moPepGen filterPeptides \
--noncanonical-peptide xxx_moPepGen.fasta \
--rnaseq-quant rsem.txt \
--query "(biotype = protein_coding) | (FPKM > 5 & ! biotype in type1,type2,type3 )"
But I agree that it is too complicated to implement. I probably don't want to do it right now..
hahaha i agree it looks great. but i have no idea how to implement that cleanly!
I think it wouldnt be too much to ask the user to provide a list of transcripts that they care to filter and a clean column of numbers to filter based on :P
Does this look like a good user interface? We can make the --biotype-keep
and --biotype--drop
optional. Seems like wo don't need the --biotype-filter
? If it's not --biotype-keep
or --biotype-drop
, it must be something want to be filtered for gene expression?
moPepGen filterPeptides \
--noncanonical-peptide xxx_moPepGen.fasta \
--rnaseq-quant rsem.txt \
--quant-type FPKM \
--quant-cutoff 5 \
--biotype-filter psudogene,lncRNA,... \
--biotype-keep protein_coding \
--biotype-drop xxx
the biotype stuff reminds me of that biotype file that we have for determining what transcript biotypes count as noncoding. can we just use something like that? i think the biotypes are too numerous for listing on command line?
Maybe we can have --keep-all-coding
, --keep-all-noncoding
and '--drop-all-coding', '--drop-all-noncoding'? They are all optional and default to False, so all genes will be filtered according to gene expression?
sorry what is the difference between 'keep' and 'drop'?
keep
is only filter these and drop
is filter everything except these? Or the otherway around?
I think we can just have a --filter_coding
--filtering_noncoding
that are both set to TRUE? If one is set to false, don't filter in that group and keep all peptides. If both are set to false, do nothing
Oh OK, that seems much easier. If some one would like to filter out all noncoding variant peptides (that's what drop means), they can just do that by using splieDatabase
For fusion, should both transcripts above the given cutoff or at least one?
haha great question. Personally I would skip fusions and circRNAs by default, cuz they have internal filters when parsed
Should we have --filter-fusion
and --filter-circ-rna
by default as False? Should we even make it possible for users to filter those?
I don't think these should even be options :P Becuase the filters do exist XD
Ok, sounds good to me!
This is definitely a bucket list item for me. It would be fantanstic if we can add a filter to remove from the noncanonical transcript databases transcripts that are not found in a sample (for example from the result of an aligner or RNA quantification algorithm). Should probably be a separate command line tool. Of course can be extended to protein coding transcripts as well!