hillerlab / TOGA

TOGA (Tool to infer Orthologs from Genome Alignments): implements a novel paradigm to infer orthologous genes. TOGA integrates gene annotation, inferring orthologs and classifying genes as intact or lost.
MIT License
165 stars 22 forks source link

HAL alignments and chaining: post processing TOGA outputs #143

Open vinitamehlawat opened 10 months ago

vinitamehlawat commented 10 months ago

Hi Dr. @MichaelHiller

I have couple of queries regarding my TOGA outputs. I am working with vertebrates genomes (size ranging from 550Mb to 1Gb, all genomes that I considered are of very good quality genomes, BUSCO > 90%). I wanted to explore gene loss among my selected group of species (PS: attached Phylogenetic fig of my data). Currently I considered only one species as reference, which I marked as Ref in tree. I wanted to see how many genes and what genes are lost in all my query genomes.

For this I used HAL alignment because your pipeline make_lastz_chains is NOT working with test data https://github.com/hillerlab/make_lastz_chains/issues/46 and https://github.com/hillerlab/make_lastz_chains/issues/45. I also raised issue there but I guess because of busy schedule they are not yet resolved.

To convert Hal into chain I used following command:

  1. hal2fasta and then faToToBit for .2bit
  2. halStats to .bed sequences; for all query genomes to get .psl
  3. halLiftover to get .psl between ref and target
  4. axtChain to get the .chain alignment between each genome.

This hal to chain process took 3-4 hrs for most of the genomes but for some it took 6-7 hrs.

Further I used these .chain files to run the TOGA, because my reference genome is very well annotated and with isoform file I got list of genes which are lost and other category as well in loss_summ_data.tsv.

I varified with some already published genome that found some genes lost for my focal/query species and I also got these genes as 'lost' in my loss_summ file.

Some questions regarding my outputs:

  1. Is this method correct to get the gene loss (I mean using Hal to chain)
  2. how to arrange the pairwise comparison – so that we can logically determine the actual “loss” in the “query”, and rule out the “gain” in the “ref”.
  3. For example in one query genome I got 1000 gene lost but out of 1000 loss are 453 and rest are "novel gens" in Ref species. Out of these 453 lots of genes are have identical gene descriptions but still have different gene ids

I would really appreciate any suggestion from your end.

Thank you very much

Slide11

MichaelHiller commented 10 months ago

Hi,

sorry to hear that the lastz pipe is not working. I asked @kirilenkobm to have a look, as I don't know this code too well. --> Something I always ask is whether you repeatModeled and repeatMasked all genomes individually. This is a common problem, resulting in long long runtimes. But I guess your problem is something else.

In principle, aligning genomes <1 GB should be no problem.

To the questions. 1) I have no experience with hal to chain as we have never tried that. Maybe you can visualize these chains in a browser to see if they align genes well, not only coding exons but also introns and intergenic sequences.

2) You need to project the hal to your reference and then extract pairwise chains for this reference to all queries. To distinguish the loss of an ancestral gene in query from a lineage-specific gene annotated in the reference (I guess this is what you mean), you need an outgroup that has the gene. E.g. if query 3 has geneX as intact and query 1 or 2 have this as a loss, then we can assume that X is an ancestral gene lost in 1 or 2.

3) Not sure I understand, but hopefully 2) answers this.

Best Michael

kirilenkobm commented 10 months ago

Hi @vinitamehlawat

I am sorry for the delay - a bit overwhelmed with my primary job. However, could you try the older version of the make chains pipeline?

Please let me know if it works for you. Theoretically, the flow must be pretty much the same in v1 and v2, but seems like it's not...

vinitamehlawat commented 10 months ago

Thank you @MichaelHiller ,

Your answers really helped me to understand some of my questions. Yes, my every genome is softmasked, and I also did the HAL projection with my Ref genome and then later on I started the chaining with other query genomes.

I will also use the make_lastz_chains to make sure the accuracy in my chainings and I will write back to you again.

vinitamehlawat commented 10 months ago

Hi @kirilenkobm , Thank you for your response

As per your suggestion, I tried the older version of the make chains pipeline, but on my HPC-server somehow this version is not able to install UCSC-tools (which are conda installation). when I did ./install_dependencies.py It gave me something like`:

Acquiring packages necessary to run Operating system: Linux Found conda installation at /share/apps/python/anaconda-3.8/bin/conda !Warning! Docker is not installed or is not reachable Conda available: True; Docker available: False

Acquiring axtChain

Conda available: trying this channel... Collecting package metadata (current_repodata.json): done Solving environment: / The environment is inconsistent, please check the package plan carefully The following packages are causing the inconsistency:

In the latest version, I did not face any installation issues. If you can suggest any alternative to installing these dependencies, that would be a big help.

vinitamehlawat commented 9 months ago

Hi @kirilenkobm

Greetings, as per your suggestion I tried the older version of your pipeline. I successfully installed all the dependencies and ran test data but got following error.

[d1/4aa5ff] NOTE: Process execute_jobs (334) terminated with an error exit status (127) -- Execution is retried (2) [b5/4f2f5d] NOTE: Process execute_jobs (332) terminated with an error exit status (127) -- Execution is retried (2) [9c/4a16b7] NOTE: Process execute_jobs (625) terminated with an error exit status (127) -- Execution is retried (3) [c7/851fa8] NOTE: Process execute_jobs (678) terminated with an error exit status (127) -- Execution is retried (2) [ea/b72980] NOTE: Process execute_jobs (719) terminated with an error exit status (127) -- Execution is retried (2) Error executing process > 'execute_jobs (67)'

Caused by: Process execute_jobs (67) terminated with an error exit status (127)

Command executed:

/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/runRepeatFiller.sh /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/jobs/infillChain_158

Command exit status: 127

Command output: ..calling RepeatFiller:

Command error: /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/runRepeatFiller.sh: line 12: --workdir: command not found

Work dir: /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/fillChain_targetquery/work/21/08e1e7d44488e6d8c672adaff8864f

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh

/home/vlamba/python3.14/lib/python3.9/site-packages/py_nf/py_nf.py:404: UserWarning: Nextflow pipeline fillChain_targetquery failed! Execute function returns 1. warnings.warn(msg) Uncaught exception from user code: Command failed: /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/doFillChain.sh HgAutomate::run('/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/...') called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/HgRemoteScript.pm line 117 HgRemoteScript::execute('HgRemoteScript=HASH(0xc3cb78)') called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 735 main::doFillChains() called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/HgStepManager.pm line 169 HgStepManager::execute('HgStepManager=HASH(0xc39a18)') called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 877 Error!!! Output file /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/target.query.allfilled.chain.gz not found! The pipeline crashed. Please contact developers by creating an issue at: https://github.com/hillerlab/make_lastz_chains

vinitamehlawat commented 9 months ago

Hello Dr. Hiller and @kirilenkobm

First I wanted to inform you both that make_lastz worked for me. As @kirilenkobm suggested me to first run makelastz using one chromosome and I tried and it worked for me. And again when I ran makelastz for whole genome I got chain alignment for my genome data.

Second I wanted to update you regarding my comparison between the orthology inference of HAL and makelastz alignment for toga:

HAL-chaining orthology.tsv

Sp A: orthology_mapping: Extracted 21518 orthology components in total orthology_mapping: Orthology class sizes:

When I compared the gene loss between both tools in makelastz I got comparetively less number; For example for this same Sp using HAL alignment I got 4289 gene loss and using makelastz I got lost gene number 3788.

I am going to use makelastz further for my data.

Thank you Vinita

MichaelHiller commented 8 months ago

Hi Vinita, thanks for posting these results. Overall I would say that hal-chains also seem to work, especially because a hal alignment can be used for several different reference species.

What I find interesting is that hal-chains have more 1:1 and more 1:0, which is unexpected. If you have the chance to visualize both types of chains (UCSC bigChain), how do 1:1 or 1:0 genes look like that differ between hal- and lastz-chains? And if you have a decent annotation of your query genome, how many of these differential TOGA annotations are supported by the annotation?

We have now also succeeded in running Cactus and will make a few tests for mammals in the next weeks.

Thx

vinitamehlawat commented 8 months ago

Thank you Dr. Hiller for pointing out orthology differences.

Yes, I do have annotations for my 2 query genomes, and I tried to map the query_annotation.bed, which was the outcome from TOGA. But that gave me the error that your annotations do not match. I realised that the gene IDs in the query annotation.bed file are still from the reference genome.

I am sorry, I am just at the very basics of bioinformatics, and I don't know how to map these annotations to query genomes or manipulate the such complex data on whole genome level Right now I am so overwhelmed with gene loss data, so I haven't thought to give it a try on the whole genome. I have follow-up queries based on this:

  1. How do convert all the annotations in query so that I can map over genome to visualise the differences between HAL & Makelastz annotations? (it would be a great help you can provide any discussion link or script to do that)
  2. I tried to run the TOGA_assembly_stats.py to see for each qury genome how bar plot look like Fig1 (In paper " Loss of a gluconeogenic muscle enzyme contributed to adaptive metabolic traits in hummingbirds"). Itried with both genome file and 2bit file but that didn't worked for me
  3. If I have lost gene IDs for all my query species i.e this gene is lost in all my query species , How to grep sequences for those gene from either codon.fasta or prot.fasta . For this I alo referenced one of your papers, "Six ref quality genomes of Bats" where you showed the mutations on exons blocks and then put the alignment below them
  4. I also found that if genome is at scaffold level and for same species genome is at Chr level there is big number of difference of missing gene; Should I consider only high quality of genome in my data ?
  5. Finally, I am stuck at Dollo parsimony step; How you put the gene loss number on Phylogeny; I mean how to made the decision that If gene is lost in your focal queries and thenn you looked at function/evolutionary adaptation reason. For example: In my data (Three outgroup & 14 ingroup) I found one gene is Intact in my all three outgroup sister Species but the Same gene is "Lost" in my 13 query species and in 14th Query this gene is "Missing" so I assume that this gene is lost at the ancestroal branch of all my ingroup species. Which tool you used to do that or how you look for each gene ID in your huge mammal data paper?

Apologies for a lot of the queries, but trust me, your suggestions, hints, or links to any discussion will help me take a rigth path to analyse this data further.

Best regards Vinita

MichaelHiller commented 8 months ago

Hi Vinita, sorry for the slow response. TOGA keeps the gene and transcript IDs of the reference in query_annotation, as it is homology based. This tells you which genes TOGA thinks is present at the query locus.

To the questions 1) We will play around with the cactus alis we have now and will check for differences. More when we have results. 2) For each query, you need a correct TOGA run, which gives a vs_$query directory that has the loss_summ and other files. This is what the script needs. 2bit is just needed for the actual TOGA run. 3) The information on inactivating mutations is contained in another file called inact_mut_data.txt. 4) With scaffold level, do you mean a genome with more assembly gaps? If so, yes, TOGA will find more genes with missing pieces in such an assembly. TOGA can work with less complete or fragmented assemblies, as our transcript classification takes this into account. Blast would just tell you that an exon or gene is not found, but it does not tell you whether this is because there is an assembly gap (missing) or a real exon deletion. 5) This example describes a gene that was likely lost in the ancestral branch leading to the clade with the 14 ingroups. We typically just screen for genes that are lost in particular species (like the hummingbirds) but present in most or all others.

vinitamehlawat commented 7 months ago

Thank you very much Dr. Hiller for answering my all queries. Looking forward to see if there is significant differences in orthology between Cactus and make_lastz.