Open vinitamehlawat opened 10 months ago
Hi,
sorry to hear that the lastz pipe is not working. I asked @kirilenkobm to have a look, as I don't know this code too well. --> Something I always ask is whether you repeatModeled and repeatMasked all genomes individually. This is a common problem, resulting in long long runtimes. But I guess your problem is something else.
In principle, aligning genomes <1 GB should be no problem.
To the questions. 1) I have no experience with hal to chain as we have never tried that. Maybe you can visualize these chains in a browser to see if they align genes well, not only coding exons but also introns and intergenic sequences.
2) You need to project the hal to your reference and then extract pairwise chains for this reference to all queries. To distinguish the loss of an ancestral gene in query from a lineage-specific gene annotated in the reference (I guess this is what you mean), you need an outgroup that has the gene. E.g. if query 3 has geneX as intact and query 1 or 2 have this as a loss, then we can assume that X is an ancestral gene lost in 1 or 2.
3) Not sure I understand, but hopefully 2) answers this.
Best Michael
Hi @vinitamehlawat
I am sorry for the delay - a bit overwhelmed with my primary job. However, could you try the older version of the make chains pipeline?
Please let me know if it works for you. Theoretically, the flow must be pretty much the same in v1 and v2, but seems like it's not...
Thank you @MichaelHiller ,
Your answers really helped me to understand some of my questions. Yes, my every genome is softmasked, and I also did the HAL projection with my Ref genome and then later on I started the chaining with other query genomes.
I will also use the make_lastz_chains to make sure the accuracy in my chainings and I will write back to you again.
Hi @kirilenkobm , Thank you for your response
As per your suggestion, I tried the older version of the make chains pipeline, but on my HPC-server somehow this version is not able to install UCSC-tools (which are conda installation).
when I did ./install_dependencies.py
It gave me something like`:
Acquiring packages necessary to run Operating system: Linux Found conda installation at /share/apps/python/anaconda-3.8/bin/conda !Warning! Docker is not installed or is not reachable Conda available: True; Docker available: False
Conda available: trying this channel... Collecting package metadata (current_repodata.json): done Solving environment: / The environment is inconsistent, please check the package plan carefully The following packages are causing the inconsistency:
In the latest version, I did not face any installation issues. If you can suggest any alternative to installing these dependencies, that would be a big help.
Hi @kirilenkobm
Greetings, as per your suggestion I tried the older version of your pipeline
. I successfully installed all the dependencies and ran test data but got following error.
[d1/4aa5ff] NOTE: Process execute_jobs (334)
terminated with an error exit status (127) -- Execution is retried (2)
[b5/4f2f5d] NOTE: Process execute_jobs (332)
terminated with an error exit status (127) -- Execution is retried (2)
[9c/4a16b7] NOTE: Process execute_jobs (625)
terminated with an error exit status (127) -- Execution is retried (3)
[c7/851fa8] NOTE: Process execute_jobs (678)
terminated with an error exit status (127) -- Execution is retried (2)
[ea/b72980] NOTE: Process execute_jobs (719)
terminated with an error exit status (127) -- Execution is retried (2)
Error executing process > 'execute_jobs (67)'
Caused by:
Process execute_jobs (67)
terminated with an error exit status (127)
Command executed:
/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/runRepeatFiller.sh /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/jobs/infillChain_158
Command exit status: 127
Command output: ..calling RepeatFiller:
Command error: /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/runRepeatFiller.sh: line 12: --workdir: command not found
Work dir: /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/fillChain_targetquery/work/21/08e1e7d44488e6d8c672adaff8864f
Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named .command.sh
/home/vlamba/python3.14/lib/python3.9/site-packages/py_nf/py_nf.py:404: UserWarning: Nextflow pipeline fillChain_targetquery failed! Execute function returns 1. warnings.warn(msg) Uncaught exception from user code: Command failed: /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/TEMP_run.fillChain/doFillChain.sh HgAutomate::run('/scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/...') called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/HgRemoteScript.pm line 117 HgRemoteScript::execute('HgRemoteScript=HASH(0xc3cb78)') called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 735 main::doFillChains() called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/HgStepManager.pm line 169 HgStepManager::execute('HgStepManager=HASH(0xc39a18)') called at /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/doLastzChains/doLastzChain.pl line 877 Error!!! Output file /scrfs/storage/vlamba/home/make_lastz_chains-1.0.0/test_out2/target.query.allfilled.chain.gz not found! The pipeline crashed. Please contact developers by creating an issue at: https://github.com/hillerlab/make_lastz_chains
Hello Dr. Hiller and @kirilenkobm
First I wanted to inform you both that make_lastz worked for me. As @kirilenkobm suggested me to first run makelastz using one chromosome and I tried and it worked for me. And again when I ran makelastz for whole genome I got chain alignment for my genome data.
Second I wanted to update you regarding my comparison between the orthology inference of HAL and makelastz alignment for toga:
Sp A: orthology_mapping: Extracted 21518 orthology components in total orthology_mapping: Orthology class sizes:
Sp A: orthology_mapping: Extracted 20671 orthology components in total orthology_mapping: Orthology class sizes:
When I compared the gene loss between both tools in makelastz I got comparetively less number; For example for this same Sp using HAL alignment I got 4289 gene loss and using makelastz I got lost gene number 3788.
I am going to use makelastz further for my data.
Thank you Vinita
Hi Vinita, thanks for posting these results. Overall I would say that hal-chains also seem to work, especially because a hal alignment can be used for several different reference species.
What I find interesting is that hal-chains have more 1:1 and more 1:0, which is unexpected. If you have the chance to visualize both types of chains (UCSC bigChain), how do 1:1 or 1:0 genes look like that differ between hal- and lastz-chains? And if you have a decent annotation of your query genome, how many of these differential TOGA annotations are supported by the annotation?
We have now also succeeded in running Cactus and will make a few tests for mammals in the next weeks.
Thx
Thank you Dr. Hiller for pointing out orthology differences.
Yes, I do have annotations for my 2 query genomes, and I tried to map the query_annotation.bed
, which was the outcome from TOGA. But that gave me the error that your annotations do not match. I realised that the gene IDs in the query annotation.bed file are still from the reference genome.
I am sorry, I am just at the very basics of bioinformatics, and I don't know how to map these annotations to query genomes or manipulate the such complex data on whole genome level Right now I am so overwhelmed with gene loss data, so I haven't thought to give it a try on the whole genome. I have follow-up queries based on this:
codon.fasta
or prot.fasta
. For this I alo referenced one of your papers, "Six ref quality genomes of Bats" where you showed the mutations on exons blocks and then put the alignment below themApologies for a lot of the queries, but trust me, your suggestions, hints, or links to any discussion will help me take a rigth path to analyse this data further.
Best regards Vinita
Hi Vinita, sorry for the slow response. TOGA keeps the gene and transcript IDs of the reference in query_annotation, as it is homology based. This tells you which genes TOGA thinks is present at the query locus.
To the questions 1) We will play around with the cactus alis we have now and will check for differences. More when we have results. 2) For each query, you need a correct TOGA run, which gives a vs_$query directory that has the loss_summ and other files. This is what the script needs. 2bit is just needed for the actual TOGA run. 3) The information on inactivating mutations is contained in another file called inact_mut_data.txt. 4) With scaffold level, do you mean a genome with more assembly gaps? If so, yes, TOGA will find more genes with missing pieces in such an assembly. TOGA can work with less complete or fragmented assemblies, as our transcript classification takes this into account. Blast would just tell you that an exon or gene is not found, but it does not tell you whether this is because there is an assembly gap (missing) or a real exon deletion. 5) This example describes a gene that was likely lost in the ancestral branch leading to the clade with the 14 ingroups. We typically just screen for genes that are lost in particular species (like the hummingbirds) but present in most or all others.
Thank you very much Dr. Hiller for answering my all queries. Looking forward to see if there is significant differences in orthology between Cactus and make_lastz.
Hi Dr. @MichaelHiller
I have couple of queries regarding my TOGA outputs. I am working with vertebrates genomes (size ranging from 550Mb to 1Gb, all genomes that I considered are of very good quality genomes, BUSCO > 90%). I wanted to explore gene loss among my selected group of species (PS: attached Phylogenetic fig of my data). Currently I considered only one species as reference, which I marked as Ref in tree. I wanted to see how many genes and what genes are lost in all my query genomes.
For this I used HAL alignment because your pipeline make_lastz_chains is NOT working with test data https://github.com/hillerlab/make_lastz_chains/issues/46 and https://github.com/hillerlab/make_lastz_chains/issues/45. I also raised issue there but I guess because of busy schedule they are not yet resolved.
To convert Hal into chain I used following command:
hal2fasta
and thenfaToToBit
for .2bithalStats
to .bed sequences; for all query genomes to get .pslhalLiftover
to get .psl between ref and targetaxtChain
to get the .chain alignment between each genome.This
hal to chain
process took 3-4 hrs for most of the genomes but for some it took 6-7 hrs.Further I used these .chain files to run the TOGA, because my reference genome is very well annotated and with isoform file I got list of genes which are lost and other category as well in
loss_summ_data.tsv
.I varified with some already published genome that found some genes lost for my focal/query species and I also got these genes as 'lost' in my loss_summ file.
Some questions regarding my outputs:
out of 1000 loss are 453
and rest are "novel gens" in Ref species. Out of these 453 lots of genes are have identical gene descriptions but still have different gene idsI would really appreciate any suggestion from your end.
Thank you very much