maxplanck-ie / snakepipes

Customizable workflows based on snakemake and python for the analysis of NGS data
http://snakepipes.readthedocs.io
387 stars 88 forks source link

Why are we modify GTF files by default? #90

Closed dpryan79 closed 6 years ago

dpryan79 commented 6 years ago

Is there a reason that GTF files are munged beyond recognition in the RNA-seq workflow? They're fine as is for bulk RNA-seq and we can't even report that "we used Gencode m15" in the methods if we modify the crap out of them.

dpryan79 commented 6 years ago

In fact, the creation of the "saf" file is a terrible idea, you end up quantifying intronic reads from pre-mRNAs and other random things (unannotated miRNAs and expressed repeats). There's a time when such things are useful, but that should simply never be the default.

dpryan79 commented 6 years ago

The generated SAF files have transcript entries rather than exon entries.

dpryan79 commented 6 years ago

I'll have a PR for this shortly, I expect. The BED files generated from this are used elsewhere. I wonder why they're made, though, since we already have transcript BED files (genes_bed).

steffenheyne commented 6 years ago

mhmmm that's weird, I havn't done this SAF stuff...sure, exon counting is wanted.

dpryan79 commented 6 years ago

I'm accumulating some changes in my WIP PR.

dpryan79 commented 6 years ago

Anyway, I think the current code goes back to the pre-snakemake days...though I have to wonder why it was done back then too.

steffenheyne commented 6 years ago

one reason to have extra bed files was for single cell RNA workflow, where I wanted to filter out pseudogenes based on biotype...yeah, the SAF part might com from the python workflow we once had

vivekbhr commented 6 years ago

Maybe it was done previously by fabian back when featurecounts used SAF format as default? (although it always accepted GTF as an option so I can't think of a clear reason to use SAF anyway)

steffenheyne commented 6 years ago

rule "annotation_bed2fasta" in filter_annotation.snakefile seem to be also a bit wrong. it is missing the "--split" and "-s" options to "bedtools getfasta". Or I'm totally confused now?

This rule is used to create the input fasta file for salmon!

steffenheyne commented 6 years ago

coming back to the OT, by using the "Annotation/genes.filtered.gtf" as input to featureCounts and the default --filter_annotation option (which is empty), the workflow used the full gtf (like gencode m15) file as give in the organism config file....

vivekbhr commented 6 years ago

I added filtered gtf as input to featurecounts in this commit