sanger-tol / treeval

Pipelines for the production of Treeval data
https://pipelines.tol.sanger.ac.uk/treeval
Other
22 stars 3 forks source link

Some uncertain questions #312

Closed gitforp closed 1 month ago

gitforp commented 1 month ago

Hello, I have some questions regarding treeval that I don't quite understand.

I recently assembled a genome, but the Hi-C results weren't very ideal. My advisor sent me a paper about gEVAL (https://doi.org/10.1093/gigascience/giaa153) and said this pipeline could improve my assembly quality. Following the links in the paper, I found treeval.

However, after running it, I didn't see any new corrected sequences in the output, and the example files on https://pipelines.tol.sanger.ac.uk/treeval/1.1.1/output don't seem to indicate that new sequence files would be generated either. But my advisor mentioned gEVAL generates new sequence fasta files.

So I'm curious if treeval and gEVAL are the same pipeline, and whether they generate corrected fasta files.

Additionally, I noticed the pipeline of genomeassembly. Is there any relationship between genomeassembly, treeval, and gEVAL?

These are my questions, and I look forward to your reply.

Best wishes!

DLBPointon commented 1 month ago

Hi @gitforp.

TreeVal (Tree of Life Evaluation browser) is the successor to gEVAL (genome EVALuation browser, which is now entirely depreciated and entirely replaced by TreeVal). Both generate the evidence required for improving the genomic assembly through manual-curation. At no point do either manipulate the original sequence. The output FASTA file from TreeVal is a renamed version of the input sequence, this is to simplify our process. In essence, they are the same pipeline, just written in very different ways, some of gEVAL was so old we had no one on hand to release new fixes (the web browser in particular, the pipeline was fantastic we just needed something newer to ensure maintainability).

TreeVal sits in the middle of 2 major processes which do modify the sequence, something TreeVal does not do, this is sanger-tol/ASCC -- A contamination and cobiont screening pipeline (not yet for external users), to remove contaminant sequence-- and manual-curation -- the process of manually correcting the sequence using the evidence found in the evidence produced by TreeVal (and previously gEVAL) using tools such as HiGlass and PretextView (and associated scripts which may not be shown in the below YouTube video as some have come into use after filming).

The most up to date version of these processes will be show-cased at BGA24 in October (1st--25th) and manual curation (BGA23) can be seen here: https://github.com/thebgacademy/manual-curation https://www.youtube.com/watch?v=F5jvh3owzl4

The Sanger-tol/genomeassembly pipeline will be the replacement for our current assembly pipeline written in another pipeline language.

genomeassembly will generate the assembly from contigs (from sequencing data) and a number of graphs to evidence quality of assembly. Upon Visual checks this will be fed to ASCC for contamination screening. TreeVal will then generate data to evidence manual manipulation of the assembly and generate a more contiguous assembly.

Everything in the sanger-tol organisation is part of a modernisation process for the Tree Of Life program at sanger.

yumisims commented 1 month ago

Just to be specific, TreeVal provides analysis metrics visualized using JBrowse2 and the high-resolution Hi-C browser HiGlass, while the gEVAL data processing pipeline produces metrics to display on the Ensembl browser, which is database-driven. The primary reason we wanted to refurbish the data analysis pipeline is that the Ensembl browser is hard to update, and we want the analysis visualization to be flat file-based. TreeVal only produces evidence that aids manual curation and does not change the input assembly FASTA file. If you have problematic Hi-C data, please consider using the genome assembly pipeline, which includes YAHS to rescaffold the assembly using Hi-C data. The genome assembly pipeline outputs a FASTA file that should go through TreeVal, and you should see some improvement in the Hi-C map returns from TreeVal. If you are inclined to proceed with manual curation, please feel free to join Assembly Curation Slack (assemblycuration.slack.com), where you can find help to further improve your assembly.

gitforp commented 1 month ago

Thank you very much for your explanations, teachers. I don't have any more questions for now. Best wishes for your work!