Open jcu23686 opened 1 year ago
Hi Jack. Based on my notes, we discussed that you would be characterizing the assemblies in Jarvis et al using QUAST, but you would not be assembling these genomes from raw data yourself. To get started you will need to download genomes from the following website: https://data.nist.gov/od/id/mds2-2578. You'll need to click the arrowhead next to assemblies-and-benchmarking_results to see all of the possible assemblies you can download.
Then you will need to unpack these assembly archives and run quast on the assembly, following something like:
wget https://data.nist.gov/od/ds/ark:/88434/mds2-2578/assemblies-and-benchmarking_results/asm1.tar.gz
tar -xvzf asm1.tar.gz
quast asm1/assembly/Ash1v1.7.fa.gz
For some assemblies (e.g. asm2) there will be two haplotypes in the archive (asm2a and asm2b). For some assemblies (e.g. asm3) there will be more than one archive (asm3a and asm3bc). I would treat each haplotype or version of an assembly (a, b, c, etc) as a separate file for analysis. In the case of asm2 and asm3 there would be a total of 5 files to analyze:
asm2ab/assembly/asm2a:
total 1605728
-rwx------@ 1 cbergman MYID\Domain Users 816529939 Mar 14 2022 Dovetail_HG002_phase1_scaffolds.fa.gz
asm2ab/assembly/asm2b:
total 1622072
-rwx------@ 1 cbergman MYID\Domain Users 816690023 Mar 14 2022 Dovetail_HG002_phase2_scaffolds.fa.gz
asm3a/assembly/asm3a:
total 1589280
-rwx------@ 1 cbergman MYID\Domain Users 811399219 Mar 14 2022 asm.fa.gz
asm3bc/assembly/asm3b:
total 1590792
-rwx------@ 1 cbergman MYID\Domain Users 809426420 Mar 14 2022 pri_asm.fa.gz
asm3bc/assembly/asm3c:
total 1507408
If you are still having trouble getting started, please come to class on Tuesday to discuss more.
For my project I am planning on using the article "Semi-automated assembly of high-quality diploid human reference genomes". In the article various techniques regarding assembly of the human genome were discussed and results were discussed. 23 different assemble combinations were used to assemble the human genome. Each assemble combination has a different pipeline and will end up with different results for scaffolds and contigs for the human genome. I will look into a few of the possible pipelines myself and see which method gives the biggest scaffolds.
A question I currently have: I was wondering how much time these sort of methods of genome assembly would take on the cluster. The article provided stated some of the pipelines required usage of cores, time, and GB much greater than what we have done in class.
Article Link https://www.nature.com/articles/s41586-022-05325-5