Closed dpryan79 closed 6 years ago
In fact, the creation of the "saf" file is a terrible idea, you end up quantifying intronic reads from pre-mRNAs and other random things (unannotated miRNAs and expressed repeats). There's a time when such things are useful, but that should simply never be the default.
The generated SAF files have transcript entries rather than exon entries.
I'll have a PR for this shortly, I expect. The BED files generated from this are used elsewhere. I wonder why they're made, though, since we already have transcript BED files (genes_bed
).
mhmmm that's weird, I havn't done this SAF stuff...sure, exon counting is wanted.
I'm accumulating some changes in my WIP PR.
Anyway, I think the current code goes back to the pre-snakemake days...though I have to wonder why it was done back then too.
one reason to have extra bed files was for single cell RNA workflow, where I wanted to filter out pseudogenes based on biotype...yeah, the SAF part might com from the python workflow we once had
Maybe it was done previously by fabian back when featurecounts used SAF format as default? (although it always accepted GTF as an option so I can't think of a clear reason to use SAF anyway)
rule "annotation_bed2fasta" in filter_annotation.snakefile seem to be also a bit wrong. it is missing the "--split" and "-s" options to "bedtools getfasta". Or I'm totally confused now?
This rule is used to create the input fasta file for salmon!
coming back to the OT, by using the "Annotation/genes.filtered.gtf" as input to featureCounts and the default --filter_annotation option (which is empty), the workflow used the full gtf (like gencode m15) file as give in the organism config file....
I added filtered gtf as input to featurecounts in this commit
Is there a reason that GTF files are munged beyond recognition in the RNA-seq workflow? They're fine as is for bulk RNA-seq and we can't even report that "we used Gencode m15" in the methods if we modify the crap out of them.