gpertea / gffread

GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction and more
MIT License
371 stars 39 forks source link

What does the -r option do in gffread? #133

Open zigge90 opened 7 months ago

zigge90 commented 7 months ago

Hi,

Can someone explain to me what the -r option does in gffread?

I have read that it: "only shows transcripts overlapping coordinate range .. (on chromosome/contig , strand if provided)"

So I thought that an additional file was needed to be provided in order for this option to work. However when I run the command without providing anything other than my annotation and my reference genome, gffread filters down from ~37 000 genes to ~19 000 genes. But I don't understand how this works since I am not providing any information about what chromosome positions to filter out.

The reason I ask is because I have an annotation that seems to have a lot of pseudo genes or isoforms that I want to get rid of. I have tried the options -V, -J and --no-pseudo but they don't really do much difference. However, when i tried the -r option just as a random test, this made a huge difference to my list of transcripts but now I don't really understand why?

This is the simple command that I ran:

gffread my_gff.gtf -r -R -F -g my_genome.fasta -y RF_prots.fa -w RF_exons.fa

Grateful for any support!

gpertea commented 7 months ago

-r should take a chromosome:start-end genomic location specification, with an optional strand prefix or suffix, e.g. -r chr1:9000-12000 should only show transcripts overlapping that genomic range on chr1 (no matter the strand), but if you only want the transcripts on the reverse strand in that region, you can write -r -chr1:9000-12000 instead. There is an example usage for that option at the bottom of this wiki page: https://github.com/gpertea/stringtie/wiki/Extracting-bundle-data-for-debugging

As for what you got there, I am afraid the way I handle the argument there is not that smart, when using -r without intentionally providing any genomic region specification, like you did, the code takes whatever token that follows there as the genomic region and attempts to parse it, but it fails silently if it could not make sense of it (it was -R in your example command line).

This is a very basic region filter and not very efficient, and as you can see it only takes one genomic region currently. You could use bedtools intersect or other specialized tools if you want, say, an intersection with a set of genomic regions specified in a BED file.