huangyh09 / brie

BRIE: Bayesian Regression for Isoform Estimate in Single Cells
https://brie.readthedocs.io
Apache License 2.0
41 stars 15 forks source link

brie-event cannot read .gtf file #9

Open aarzalluz opened 6 years ago

aarzalluz commented 6 years ago

Hi,

I am trying to use BRIE using a custom annotation file and the GENCODE M10 annotation file. In both cases, running brie-event as follows:

brie-event -a gencode.vM10.annotation.gtf

returns errors and does not succeed in producing the SE files. Note that my custom annotation has the same format as the GENCODE file, and neither file causes problems when used with other software tools. The format is:

GENCODE: chr1 HAVANA gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693.1"; (...)

Custom: chr1 mm10_index exon 4777525 4777648 100 - . transcript_id "PB.1.1"; gene_id "Mrpl15"; (...)

In the case of my custom annotation, the error is the following:

 File "/home/angeles/anaconda3/bin/brie-event", line 11, in <module>
    sys.exit(main())
  File "/home/angeles/anaconda3/lib/python3.6/site-packages/brie/events/event_maker.py", line 166, in main
    sanitize=options.sanitize)
  File "/home/angeles/anaconda3/lib/python3.6/site-packages/brie/events/event_maker.py", line 59, in defineAllSplicing
    DtoA_F, AtoD_F, DtoA_R, AtoD_R = prepareSplicegraph(anno_file, ftype)
  File "/home/angeles/anaconda3/lib/python3.6/site-packages/brie/events/event_maker.py", line 38, in prepareSplicegraph
    DtoA_F, AtoD_F, DtoA_R, AtoD_R)
  File "/home/angeles/anaconda3/lib/python3.6/site-packages/brie/events/parseTables.py", line 80, in populateSplicegraph
    data =  readTable_gff(table_f, ftype)
  File "/home/angeles/anaconda3/lib/python3.6/site-packages/brie/events/parseTables.py", line 60, in readTable_gff
    exons.append([str(int(vals[3])-1), vals[4]])
UnboundLocalError: local variable 'exons' referenced before assignment

In the case of the GENCODE file, the error is slightly different, although it seems to involve the same two python scripts, event_marker.py and parseTables.py:

  File "/home/angeles/anaconda3/bin/brie-event", line 11, in <module>
    sys.exit(main())
  File "/home/angeles/anaconda3/lib/python3.6/site-packages/brie/events/event_maker.py", line 166, in main
    sanitize=options.sanitize)
  File "/home/angeles/anaconda3/lib/python3.6/site-packages/brie/events/event_maker.py", line 59, in defineAllSplicing
    DtoA_F, AtoD_F, DtoA_R, AtoD_R = prepareSplicegraph(anno_file, ftype)
  File "/home/angeles/anaconda3/lib/python3.6/site-packages/brie/events/event_maker.py", line 38, in prepareSplicegraph
    DtoA_F, AtoD_F, DtoA_R, AtoD_R)
  File "/home/angeles/anaconda3/lib/python3.6/site-packages/brie/events/parseTables.py", line 96, in populateSplicegraph
    for i in range(len(startvals) - 1):
TypeError: object of type 'map' has no len()

It seems to me that BRIE cannot get the right information from the .gtf files, and the variables are therefore empty when referenced downstream, which then causes the error... but I wonder why that is happening?

Thanks in advance,

Ángeles

huangyh09 commented 6 years ago

Hi,

Thanks for using BRIE, and reporting the issue.

It seems the reason for the errors is that the GTF files didn't match the BRIE's requirement. BRIE requires the that features of a gene are listed in a particular order, as follows.

gene
transcript
exon
exon (optional)
exon (optional)
transcript (optional)
exon
exon (optional)
exon (optional)

Could you check if your GTF files have features in such order?

Cheers, Yuanhua

aarzalluz commented 6 years ago

In the case of the custom annotation file, the features are not in that order, in fact, my .gtf file only contains the annotation info for the exons. As for the GENCODE file, the feature order seems to be the one you indicate, but other features such as CDS, start_codon, stop_codon and UTR are included as well.

Is there any way around this?

huangyh09 commented 6 years ago

For your custermised alternative exons, I guess you also have the upstream and downstream exons, right? If so, then you could create a gtf file with pseudo gene and pseudo transcripts (one including the exon, the other excuding). brie-event also gives the same format.

For the GENCODE file, the CDS, start_codon, stop_codon and UTR will be ignored, so won't affect. I haven't figured out why the above error occurs for this file. I have just downloaded M10, and brie-event works fine. BTW, which version of BRIE do you use? Is it Python 2 or 3? So I can test the specific one.

aarzalluz commented 6 years ago

It is the python3 version. I could always try to download the M10 file again and try, see if the file got corrupted somehow.

Concerning the custom annotation, I see your point. However, that would ultimately yield an expression estimate per event (exclusion and inclusion, for every spliced exon). Is there any way to use BRIE to obtain a single expression value per transcript? That is, not an estimate per exon inclusion/exclusion event, but a value for every true isoform/transcript of the gene. That is my ultimate goal, and the main reason why I'd rather not building a pseudo transcript annotation file.

huangyh09 commented 6 years ago

Thanks Ángeles for the updates. I figured out it is the incompatibility of brie-event with Python3. I will fix it and let you know.

Unfortunately, BRIE doesn't provide a full transcript quantification (actually it is an estimate as well, assigning reads to multiple competing transcripts from the same gene). There are quite a few good methods developed for bulk data. As you can find from the BRIE paper, the benefits of BRIE is that the sequence features bring good prediction to aid the splicing estimate in single cells. However, such features are hard to obtain at the transcript level, otherwise BRIE will be upgraded with the option for transcript level quantification.

aarzalluz commented 6 years ago

As you pointed out, I usually use RSEM to estimate expression at the transcript level. However, after reading the paper I was curious to try out BRIE and see whether it could provide the best of both worlds... But, as you're saying, it is not a trivial problem, considering the sparsity of single-cell data. It will be great to hear about that upgrade, if a time comes when it is possible.

Do let me know about the fix! Thanks for your help.