mortazavilab / TALON

Technology agnostic long read analysis pipeline for transcriptomes
MIT License
134 stars 31 forks source link

Problem with talon_initialize_database #140

Open Kiliankleemann opened 11 months ago

Kiliankleemann commented 11 months ago

Tried to run talon_initialize_database but got an error:

talon_initialize_database --f  reference/GRCh38_GENCODE_rmsk_TE_reformatted.gtf \
  --g hg38_rmsk_ucsd \
  --a hg38 \
  --o hg38 
chrY
bulk update genes...
bulk update gene_annotations...
Traceback (most recent call last):
  File "/home/kilian/anaconda3/envs/talon/bin/talon_initialize_database", line 8, in <module>
    sys.exit(main())
  File "/home/kilian/anaconda3/envs/talon/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 1073, in main
    populate_db(db_name, annot_name, chrom_genes, chrom_transcripts, exons, genome_build)
  File "/home/kilian/anaconda3/envs/talon/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 634, in populate_db
    add_transcripts(c, transcripts, annot_name, gene_id_map, genome_build)
  File "/home/kilian/anaconda3/envs/talon/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 743, in add_transcripts
    db_gene_id = gene_id_map[native_gene_id]
KeyError: 'AluY'
Kiliankleemann commented 11 months ago

I made sure the reformatting of GTF is correct:

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz
gzip -d *.gz
talon_reformat_gtf -g reference/GRCh38_GENCODE_rmsk_TE.gtf

talon_initialize_database --f reference/GRCh38_GENCODE_rmsk_TE_reformatted.gtf \
  --g hg38_rmsk_ucsd \
  --a hg38 \
  --o hg38 
fairliereese commented 11 months ago

Would you be able to share the GTF that you're using with me? I will try running it on my end and see if I can pinpoint the issue.

Kiliankleemann commented 11 months ago

Should be able to download the gtf and unzp with the first command - thats the one I tried

fairliereese commented 11 months ago

This one? https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz This does not look like a GTF to me. For example, the strand should be in the 6th column (0-indexed), but looks like it's in the 9th column of your file.

Kiliankleemann commented 11 months ago

Which gtf did you use for hg38 repeatmasker?

fairliereese commented 11 months ago

For me to best help you, you should send all the commands that you used to download / format your GTF. I think I'm missing some information from your side.

sojichld commented 8 months ago

I'm having a similar issue and I'm not really sure why. I've also tried using the gtf formatter with no luck.

It took 0:00:00.01 to process chromosome NW_023397527.1 Traceback (most recent call last): File "/users/aademilu/.local/bin/talon_initialize_database", line 8, in sys.exit(main()) File "/users/aademilu/.local/lib/python3.8/site-packages/talon/initialize_talon_database.py", line 1015, in main populate_db(db_name, annot_name, chrom_genes, chrom_transcripts, exons, genome_build) File "/users/aademilu/.local/lib/python3.8/site-packages/talon/initialize_talon_database.py", line 596, in populate_db transcripts = chrom_transcripts[chromosome] KeyError: 'NW_023397527.1'

I've attached an example of the file. The full file can be found here. gtf_example.txt

fairliereese commented 8 months ago

Can you please send me the exact command you tried for talon_initialize_database, as well as the version number of TALON that you're using?

sojichld commented 7 months ago

Can you please send me the exact command you tried for talon_initialize_database, as well as the version number of TALON that you're using?

talon_initialize_database --f ../../reference/GCF_004126475.2_mPhyDis1.pri.v3_genomic.gtf --a discolor_annot --g discolor --o discolor

Where can I find version information?

fairliereese commented 7 months ago

I don't think there's a nice way to access the version info now, but if you haven't updated TALON in a long time it might be worth pulling and installing the latest commits. On my machine, I am able to run your init command with gtf_example.txt no problem. Did you also verify that you're having an issue with the small file too?

sojichld commented 7 months ago

Yes, while that one does run for me as well (it doesn't inlcude NW_023397527.1 ), I cannot get other cuts of the file to work, it creates an error as follows:

    genes, transcripts, exons = read_gtf_file(gtf_file)
  File "/users/aademilu/.local/lib/python3.8/site-packages/talon/initialize_talon_database.py", line 495, in read_gtf_file
    entry_type = tab_fields[2]

I noticed that the gene is the only one of that scaffold, maybe that could be the issue? I have provided the full file, which will run until the scaffold in question. The program will run if I remove the gene from the gtf.

fairliereese commented 7 months ago

The problem is that gene does not have any transcripts annotated to it. If you look, it goes from one gene entry (the one on your NW_023397527.1 chromosome) to the next, without any additional entry. I would advise removing this entry from your GTF and moving on with your analysis. Screenshot 2024-02-03 at 11 17 58 AM

sojichld commented 7 months ago

You're right. Strange. I've removed it an it works just fine. Was able to fully process everything. Thanks.