hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
148 stars 23 forks source link

About extracting CDS and PEP from GFF files obtained from TOGA #164

Open SWei2333 opened 3 months ago

SWei2333 commented 3 months ago

Hi Recently, I've been using annotation files for some species obtained via the TOGA method from Zoonomia. When I attempted to extract CDS and PEP from these annotation files and genomes, I found that more than half of the CDS sequences extracted using itools were not multiples of 3. Upon inspecting these sequences, I discovered that half of them had incomplete stop codons, with only one or two bases remaining. The other half ended with TGA but were still not multiples of 3. The reason for this is currently unknown.

Do you have any suggestions for this issue? Thank you very much.

Best wishes

MichaelHiller commented 3 months ago

Can you pls extract only projections (transcript.gene.chainID) that are classified as intact or partially intact? This info is in loss_summ_data.tsv.gz (https://genome.senckenberg.de/download/TOGA/README.txt).

Other projections have frameshifts or stop codons or missing sequences and this may not translate.

SWei2333 commented 3 months ago

Hi I compared the genes with not multiples of 3 to the loss_summ_data.tsv.gz file and found that a significant portion of the genes marked as "I" still have CDS sequences that are not multiples of 3, primarily due to incomplete stop codons.

MichaelHiller commented 3 months ago

Can you send an example? We have all the data also in the genome browser http://genome.senckenberg.de/ Pls type a projection into the browser of the query species and click on the TOGA annotation. This shows the protein alignment and a list of inactivating mutations.

Send me one, I can have a look as well

SWei2333 commented 3 months ago

Thank you very much for your suggestions. While organizing my files, I suddenly realized a potential issue. The GTF files I downloaded from Zoonomia contain separate lines for start_codon and stop_codon. However, when converting them to GFF format using gffread, an error occurred, causing some bases of the stop codons to not be included in the CDS regions, resulting in many sequences not being multiples of 3. Thank you very much for your help. I will correct this error and try again. If the problem persists, I will seek your advice again. Thank you very much.

SWei2333 commented 3 months ago

Hi, I recently downloaded the genomes and corresponding TOGA.gtf files for several species from Zoonomia. I used gffread to extract the protein sequences (pep) from the GTF files and performed BUSCO scoring. I found that the scores were generally around 80%. Could the issue be with the way I extracted the protein sequences, or is it due to the inherent quality of the genomes resulting in GTF files of this level? below is a result

C:76.1%[S:44.2%,D:31.9%],F:6.6%,M:17.3%,n:9226 | |7020 Complete BUSCOs (C) | |4078 Complete and single-copy BUSCOs (S) | |2942 Complete and duplicated BUSCOs (D) | |613 Fragmented BUSCOs (F) | |1593 Missing BUSCOs (M) | |9226 Total BUSCO groups searched

Thank you very much.

MichaelHiller commented 3 months ago

For which assembly are these stats? And how does that compare to the BUSCO stats we have in the supplement tables?

SWei2333 commented 3 months ago

I downloaded the GTF file for Dinomys branickii using the mouse as the reference. In your table, the annotation of this species with the human as the reference has a BUSCO score of 9,025 97.82% 86 115(Complete (s + d) Complete% Fragmented Missing)

MichaelHiller commented 2 months ago

This is the HLdinBra1 DNAzoo assembly, right? Of course, mouse or human as the reference will make a difference, but can't explain 97.8% vs. 76% completeness.

Can you pls download the protein alignment file and extract only QUERY sequences and BUSCO this? https://genome.senckenberg.de/download/TOGA/human_hg38_reference/Rodentia/Dinomys_branickii__pacarana__HLdinBra1/proteinAlignments.fa.gz This should reproduce the BUSCO values we report.

SWei2333 commented 2 months ago

I used this file for the BUSCO analysis and obtained the same results as you did. However, I'm a bit confused about why the BUSCO scores decrease when I extract the peptide sequences (PEP) from the GTF file. Could there be an issue with my method of extracting the PEP? I have used gffread as well as itools.

MichaelHiller commented 2 months ago

Likely TOGA masks frameshifts and internal stop codons, while your GTF to protein may not do that. This could explain it. Can you look at BUSCO genes that are complete in our protein file but not "GTF to protein"?

SWei2333 commented 2 months ago

Hello,

In the past few days, I have tested a new species that I assembled myself. I ran the make-lastz-chain and TOGA processes on this genome, which has a size of 2.7G and is well-masked for repeat sequences. When TOGA completed, I extracted the query peptides from the prot.fasta file, but the number of sequences was only 1933. This seems strange. My reference genome is mm10, as provided in TOGA. I will provide my lastz log file and TOGA log file. Could you please give me some advice? Thank you very much. 图片1 [Uploading toga_2024_06_25_at_14_54.log…]() [Uploading run.log…]()

SWei2333 commented 2 months ago

lastz log : https://drive.google.com/file/d/1yAay3gBEw-_IXPKjCe54rWOk17Xw7is8/view?usp=sharing toga log : https://drive.google.com/file/d/1h0ry0-oc4C2pioiCjZos5Ouzu1zlYUZ1/view?usp=sharing

MichaelHiller commented 2 months ago

20000 genes are 1:0. This is the main problem.

Can you visualize your chains in the UCSC browser? E.g. convert to bigChain and load as a custom track? How large is the gziped chain file. It must be a few hundred MB. I assume the chains are incomplete.

SWei2333 commented 2 months ago

Hi, The gziped chain file is 112M, I think the size is enough, I'll try to visualize it.

SWei2333 commented 2 months ago

Hi,

I have tested the LASTZ alignment for the species twice, and it seems there are no errors. The file sizes are the same, but I'm still confused about where the issue might be. I also couldn't identify any problems using the UCSC browser. Could you provide some advice?

Thank you!

MichaelHiller commented 2 months ago

Can you pls point me to the query assembly you used? I can try to produce the chains here. Then you can compare it.

SWei2333 commented 2 months ago

Thank you for your help! https://drive.google.com/file/d/1DdAbWqeiowVAwY3svxh1Q_c2Mg1OehcY/view?usp=sharing Above is my 2bit file.

MichaelHiller commented 2 months ago

Thx. Is the query actually available on NCBI or DNAzoo ? Or is this a private assembly? Is hg38 your reference?

SWei2333 commented 2 months ago

Sorry, it's a private assembly, and the reference is mm10

MichaelHiller commented 2 months ago

OK, then I need to know the taxID of the query. And did you already run RepeatModeler + RepeatMasker (softmasking) on the query?

SWei2333 commented 2 months ago

Sure, I have run RepeatModeler + RepeatMasker+Repeatproteinmasker+TRF to mask the genome.(soft mask). The query is assemblied by myself and the taxid is 423607. Thank you for your help!

MichaelHiller commented 2 months ago

OK, I started the chains. Will take a few days as our cluster is pretty full

MichaelHiller commented 1 month ago

Everything finished on our side. I think the chains have a similar size to yours. Pls download everything from https://genome.senckenberg.de/download/HLfukMec1/ and send me a ping. I'll delete it then.

Here is the status of the genes: g GENE loss_summ_data.tsv | cut -f3 | s | uniq -c 17755 I 2252 L 344 M 263 PG 108 PI 59 PM 999 UL

Looks good.

I made a quick hack to transfer our ancestral gene set from hg38 to mm10. Compared to HLfukDam2 = GCF_012274545.1 your assembly is perfectly fine. intact genes genes with inactivating mutations genes with missing sequence vs_HLfukMec1 15019 1090 124 vs_HLfukDam2 11690 1080 3463

SWei2333 commented 1 month ago

Thank you very much for your help. I have downloaded all the files. I will compare our log files to see where the issue with my TOGA might be.

Michael Hiller @.***> 于2024年7月7日周日 03:40写道:

Everything finished on our side. I think the chains have a similar size to yours. Pls download everything from https://genome.senckenberg.de/download/HLfukMec1/ and send me a ping. I'll delete it then.

Here is the status of the genes: g GENE loss_summ_data.tsv | cut -f3 | s | uniq -c 17755 I 2252 L 344 M 263 PG 108 PI 59 PM 999 UL

Looks good.

I made a quick hack to transfer our ancestral gene set from hg38 to mm10. Compared to HLfukDam2 = GCF_012274545.1 your assembly is perfectly fine. intact genes genes with inactivating mutations genes with missing sequence vs_HLfukMec1 15019 1090 124 vs_HLfukDam2 11690 1080 3463

— Reply to this email directly, view it on GitHub https://github.com/hillerlab/TOGA/issues/164#issuecomment-2211900888, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCEZIXDAFZ357BX4Z6PLZLBB3BAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRHEYDAOBYHA . You are receiving this because you authored the thread.Message ID: @.***>

SWei2333 commented 1 month ago

Hi, I looked closely at the log files and found a big difference in the output, but I'm not sure why. The differences I can think of might be the following,

  1. The reference files I used were provided by toga's github repository, including mm10.2bit,toga /TOGAInput/mouse_mm10/toga.transcripts.bed, toga.isoforms.tsv, toga. U12introns.tsv
  2. The version of TOGA used was obtained from git clone directly
  3. Use the local runtime instead of slurm

My command is "./toga.py mm10.Fmechowii.final.chain.gz /home/TOGA/TOGAInput/mouse_mm10/toga.transcripts.bed /home/TOGA/mm10.2bit Fmechowii.2bit --kt --pn ./output/Fmechowii -i /home/TOGA/TOGAInput/mouse_mm10/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --nd ./output/Nextflow.log --cb 12,15,18,21,24,27,30,33,36,39,42,45,47,49,60,80,100 --cjn 300 --u12 /home/TOGA/TOGAInput/mouse_mm10/toga.U12introns.tsv --ms"

MichaelHiller commented 1 month ago

The parameters look good. There will be very few genes that require more than your maximum of 100 GB. No idea why the results are different. Did you re-run TOGA with the chains that I computed. Then we can at least exclude a chain problem.

SWei2333 commented 1 month ago

Thank you for your advice, I'll try again!

Michael Hiller @.***> 于2024年7月10日周三 04:06写道:

The parameters look good. There will be very few genes that require more than your maximum of 100 GB. No idea why the results are different. Did you re-run TOGA with the chains that I computed. Then we can at least exclude a chain problem.

— Reply to this email directly, view it on GitHub https://github.com/hillerlab/TOGA/issues/164#issuecomment-2218623215, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCEZGCVFSEERW6NMK5Q3ZLQ7EZAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGYZDGMRRGU . You are receiving this because you authored the thread.Message ID: @.***>

SWei2333 commented 1 month ago

Hello, I used your chain file, and the result did improve a bit, and the protein sequence on the reference alignment can be increased, but there are still as many as 27565 transcripts annotated as one2zero, so I think there is still a problem with TOGA, because their log files are quite different. Can you advise on this, or could you please provide the version of TOGA and command you are running.

Michael Hiller @.***> 于2024年7月10日周三 04:06写道:

The parameters look good. There will be very few genes that require more than your maximum of 100 GB. No idea why the results are different. Did you re-run TOGA with the chains that I computed. Then we can at least exclude a chain problem.

— Reply to this email directly, view it on GitHub https://github.com/hillerlab/TOGA/issues/164#issuecomment-2218623215, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCEZGCVFSEERW6NMK5Q3ZLQ7EZAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJYGYZDGMRRGU . You are receiving this because you authored the thread.Message ID: @.***>

MichaelHiller commented 1 month ago

We are running 1.1.4 (the latest public version). Which version do you run? I'll ask Yury to have a look

SWei2333 commented 1 month ago

Hello, the version I am using is the latest version 1.1.7. I think the difference in the log file may be the cause of the problem. There should not be such a big difference when running in local or slurm mode. Could you please compare and make comments?

Michael Hiller @.***> 于2024年7月15日周一 18:50写道:

We are running 1.1.4 (the latest public version). Which version do you run? I'll ask Yury to have a look

— Reply to this email directly, view it on GitHub https://github.com/hillerlab/TOGA/issues/164#issuecomment-2228214159, or unsubscribe https://github.com/notifications/unsubscribe-auth/BIAZCE5EQDP54UFYHUSZQY3ZMOSPFAVCNFSM6AAAAABIVUVZMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRYGIYTIMJVHE . You are receiving this because you authored the thread.Message ID: @.***>

ReverendCasy commented 1 month ago

Hello, For some reason I cannot download the running log, could you please send it to me? Also, if it is there, could you also send me temp/_cesar_crashed_jobs.txt file?

UPD: Sorry, found the right link to the log. From what I see, 86 output files are empty, likely due to some threads failing with the “OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable” error. This is likely a problem on your cluster’s side (see, e.g., https://stackoverflow.com/questions/52026652/openblas-blas-thread-init-pthread-create-resource-temporarily-unavailable). Anyway, I wonder why TOGA did not catch the error properly. Please let us know if fixing the cluster setup helps. In the meanwhile we will check the crashed job handling.

Hello, the version I am using is the latest version 1.1.7. I think the difference in the log file may be the cause of the problem. There should not be such a big difference when running in local or slurm mode. Could you please compare and make comments? Michael Hiller @.***> 于2024年7月15日周一 18:50写道:

molinfzlvvv commented 1 month ago

Hi, @ReverendCasy

I'm the partner of the questioner. Based on your suggestion, I think this error may be due to a user process limitation issue. However, I can only run it in local mode, so I tested it twice, v3 and v4, respectively, reducing the number of buckets to reduce the number of processes created (--cb). Both times, the operation ended successfully, but there was still this problem in the log file, and the running result was quite different from that of yours. In addition, attach my "_cesar_crashed_jobs. txt file".

v3_cesar_crashed_jobs.txt v4_cesar_crashed_jobs.txt result comparsion.docx

To my surprise, v4 has the lowest number of buckets, so it should have fewer problems exceeding the user limit and better results. But looking from the result the v3 result is better, I am still can't understand what reason be.

v3 command: ./toga.py mm10.HLfukMec1.allfilled.chain /home/TOGA/TOGAInput/mouse_mm10/toga.transcripts.bed mm10.2bit Fmechowii.2bit --kt --pn /opt/synData2/gene_loss/Fmev3 -i /home/TOGA/TOGAInput/mouse_mm10/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --cb 10,20,30,40,50,60,70,80,90,100 --cjn 300 --u12 /home/TOGA/TOGAInput/mouse_mm10/toga.U12introns.tsv --ms

v4 command: ./toga.py mm10.HLfukMec1.allfilled.chain /home/TOGA/TOGAInput/mouse_mm10/toga.transcripts.bed mm10.2bit Fmechowii.2bit --kt --pn /opt/synData2/gene_loss/Fmev4 -i /home/TOGA/TOGAInput/mouse_mm10/toga.isoforms.tsv --nc /home/TOGA/nextflow_config_files --cb 10,30,50,70,90,100 --cjn 300 --u12 /home/TOGA/TOGAInput/mouse_mm10/toga.U12introns.tsv --ms

Thank you for your time to answer for me. I hope you can give me some suggestions. Looking forward to your reply.

ReverendCasy commented 1 month ago

Hi @molinfzlvvv , Thank you for the test. The number of jobs is controlled with the --cjn option; could you please run any of these commands with cjn set to, say, 150?

molinfzlvvv commented 1 month ago

Hi @ReverendCasy,

As you suggested, I set the --cjn option to 150, but the results did not improve much. After carefully comparing the differences, I found that the only change was in the executor configuration. Specifically, I changed the executor option for all processes in the nextflow_config_files folder from 'slurm' to 'local'. But I still don't understand why there are so many differences in the log file content. I am using the latest version, 1.1.7; is it the same version you are using? Can you offer your opinion on this matter?

image

Currently, I am struggling to execute TOGA in its entirety, as I frequently encounter tasks being interrupted or skipped. This poses a significant challenge. Looking forward to your answer.

ReverendCasy commented 1 month ago

Hi @molinfzlvvv , Sorry for the late reply. Could I also get output logs for the last run?

I am using the latest version, 1.1.7; is it the same version you are using?

Yep

Can you offer your opinion on this matter?

Right now I assume that the problems arise from either the cluster settings or memory limitations. One thing you can try is repartitioning memory buckets (say, --cb 3,7,10,15,20,30,40,50) to get rid of the heaviest jobs. Also, since reducing the absolute number of CESAR jobs helped a bit in the last run, you could also try combining lighter memory buckets and smaller --cjn number. Hope that helps.

molinfzlvvv commented 1 month ago

Hi,@ReverendCasy

Thank you very much for your repeated assistance. I will upload the latest log soon. Your suggestion might be effective, but could it potentially lead to the loss of alignments for genes that occupy a lot of memory?

log.zip

ReverendCasy commented 1 month ago

Hi @molinfzlvvv ,

but could it potentially lead to the loss of alignments for genes that occupy a lot of memory?

True, and for now I would like to make sure that the problem does no concern insufficient memory resources on your machine and/or erroneous job binning in TOGA.

P.S. Have you or @SWei2333 contacted the cluster administrator regarding the user process limit? Might be the most obvious solution, given the OpenBLAS error spotted in the previous run.

molinfzlvvv commented 4 weeks ago

Hi,@ReverendCasy

Thanks to your advice, I tried to fix the error about OpenBLAS. Fortunately, TOGA worked well after I reduced the OPENBLAS_NUM_THREADS parameter, and I think it was solid from the results.

image

I sincerely thank you for your patience and assistance.

ReverendCasy commented 4 weeks ago

Hi @molinfzlvvv , You’re welcome. The results look much better indeed. Feel free to comment here if you spot any other inconsistencies in your results.

molinfzlvvv commented 3 weeks ago

Hi @ReverendCasy,

As an additional problem, I have a species that has a single point mutation in its genome sequence, but one of the transcripts has a premature stop codon due to exon skipping and so on. My current vision is to do an assembly of the transcriptome data and TOGA with the reference genome to identify mutations in different transcripts, Is this theoretically possible?

ReverendCasy commented 3 weeks ago

Hi @molinfzlvvv Sorry, not sure if I follow. Do you suspect the premature stop codon to be a false positive call? Are there any other reference transcripts for the same gene from which you can recover the point mutation? Do you want to use transcriptomic data for verifying mutations predicted by TOGA or to run TOGA with transcriptome assembly as query? The latter will likely lead to skewed orthology predictions given the lost synteny data. P.S. Since the last questions do not relate to the original issue, please feel free to mail me directly (yury.malovichko@senckenberg.de)