gpertea / gffcompare

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
MIT License
199 stars 32 forks source link

Only 1 output file, input GTF is overwritten (v0.9.9) #9

Closed malloryfreeberg closed 7 years ago

malloryfreeberg commented 7 years ago

I'm using the command below to compare a merged transcript GTF file from StringTie with annotated UCSC transcripts. gffcompare (v0.9.9) completes but only produces one output file, test_gffcompare.stats, which is empty except for 2 header lines of information about the command call. Also, my input merged_transcripts.gtf was overwritten with the correct gene names, so it appears that gffcompare is working in terms of reference annotation matching. Is overwriting the input file correct behavior? Where are the other output files (e.g. test_gffcompare.combined.gtf, test_gffcompare.loci)? I basically followed the steps outlined here, and I cloned the gffcompare repo just this morning.

Thanks :] Mallory

gffcompare -r UCSC_Main_Human_refGene.gtf -T -V -o test_gffcompare merged_transcripts.gtf

gpertea commented 7 years ago

Not sure what went on there, it should definitely NOT overwrite the input file (unless you are unlucky enough to provide an output prefix which matches the first part of your input file name and your input file has the suffix .annotated.gtf, or .combined.gtf in some cases -- see below).

So I ran the same gffcompare command you did, with a freshly cloned gffcompare and gclib, with pretty much the same parameters you showed above (I renamed my input file to merged_transcripts.gtf just to emulate what you did). The output files I got are these, as expected:

  test_gffcompare.stats
  test_gffcompare.annotated.gtf
  test_gffcompare.loci
  test_gffcompare.tracking

The only way an input file would be overwritten (with -o test_gffcompare and your other options there) is if your input file were actually named test_gffcompare.annotated.gtf.

Not sure how you got your input file overwritten in your case, perhaps there was another script running at the same time or something else going on.. Was that the exact command line you see in the test_gffcompare.stats file?

With the default parameters and a single input file gffcompare goes into "annotation" mode, which generates a <output_prefix>.annotated.gtfinstead of <output_prefix>.combined.gtf

malloryfreeberg commented 7 years ago

Below is the entire test_gffcompare.stats file:

# gffcompare v0.9.9 | Command line was:
#gffcompare -r UCSC_Main_Human_refGene.gtf -T -V -o test_gffcompare merged_transcripts.gtf
#

It also appears that gffcompare didn't actually overwrite or change the input merged_transcripts.gtf file. I used diff on an older StringTie output by mistake. However, I still don't get any additional output files. Here is the stdout output from gffcompare. There are some errors and warnings, but it looks like it is just for a subset of transcripts. The UCSC_Main_Human_refGene.gtf file contains all human genes, and the merged_transcripts.gtf is generated by StringTie from a total RNA-seq experiment. Am I missing something obvious here? Please let me know if sending you the actual input files might help diagnose the issue.

Thanks :]

gpertea commented 7 years ago

The errors I see there are quite troubling.. (they are not just the usual warnings that I see when normal annotation is being processed), so perhaps gffcompare actually crashed or exited abnormally (weird that it wasn't obvious that it did that.. truth is I haven't implemented an attitude like: "too many errors, I give up.." even though it's possible that something like that actually happened.. in the sense that there was some big error that eventually caused it to crash silently).

Indeed it would help a lot if you could share the actual input files you used (both query and reference), I'd like to see what's going on there.

malloryfreeberg commented 7 years ago

Below are the commands I used including the stringtie --merge command used before gffcompare

## Assembled transcripts using stringtie for ~1500 individual RNA-seq experiments (~1500 BAM files)
stringtie --merge -G UCSC_Main_Human_refGene.gtf *assembled_transcripts.gtf -o merged_transcripts.gtf
gffcompare -r UCSC_Main_Human_refGene.gtf -T -V -o test_gffcompare merged_transcripts.gtf

The reference gtf can be downloaded here and the transcript gtf can be downloaded here.

Many thanks for looking into this!

gpertea commented 7 years ago

Wow. Sorry about all those errors, it turns out that my always brittle GFF/GTF parsing code was choking on duplicate transcript IDs and also on gene_IDs which were identical to transcript_IDs (UCSC has this bad habit of setting gene_id to be the same with transcript_id for some transcripts, which messed up hierarchical relationships between gene->transcripts->exons in my fragile code..).

Hopefully all these problems were fixed in the gclib updates I committed (for now.. again..) so I updated the gffcompare downloads from http://ccb.jhu.edu/software/stringtie/gff.shtml#gffcompare_dl, or you can just pull again the gclib repository from github and rebuild (the necessary changes were only there, not in the gffcompare repo). Please let me know if you have any more problems with this updated code. Thanks!

malloryfreeberg commented 7 years ago

Success! Everything works :] Thank you for the fix!