jcu23686 / BINF8940

GENE(BINF)8940E class repository
0 stars 0 forks source link

Project_Week1 #6

Open jcu23686 opened 1 year ago

jcu23686 commented 1 year ago

For my project I am planning on using the article "Semi-automated assembly of high-quality diploid human reference genomes". In the article various techniques regarding assembly of the human genome were discussed and results were discussed. 23 different assemble combinations were used to assemble the human genome. Each assemble combination has a different pipeline and will end up with different results for scaffolds and contigs for the human genome. I will look into a few of the possible pipelines myself and see which method gives the biggest scaffolds.

A question I currently have: I was wondering how much time these sort of methods of genome assembly would take on the cluster. The article provided stated some of the pipelines required usage of cores, time, and GB much greater than what we have done in class.

Article Link https://www.nature.com/articles/s41586-022-05325-5

cbergman commented 1 year ago

Hi Jack. Based on my notes, we discussed that you would be characterizing the assemblies in Jarvis et al using QUAST, but you would not be assembling these genomes from raw data yourself. To get started you will need to download genomes from the following website: https://data.nist.gov/od/id/mds2-2578. You'll need to click the arrowhead next to assemblies-and-benchmarking_results to see all of the possible assemblies you can download.

Then you will need to unpack these assembly archives and run quast on the assembly, following something like:

wget https://data.nist.gov/od/ds/ark:/88434/mds2-2578/assemblies-and-benchmarking_results/asm1.tar.gz
tar -xvzf asm1.tar.gz
quast  asm1/assembly/Ash1v1.7.fa.gz

For some assemblies (e.g. asm2) there will be two haplotypes in the archive (asm2a and asm2b). For some assemblies (e.g. asm3) there will be more than one archive (asm3a and asm3bc). I would treat each haplotype or version of an assembly (a, b, c, etc) as a separate file for analysis. In the case of asm2 and asm3 there would be a total of 5 files to analyze:

asm2ab/assembly/asm2a:
total 1605728
-rwx------@ 1 cbergman  MYID\Domain Users  816529939 Mar 14  2022 Dovetail_HG002_phase1_scaffolds.fa.gz

asm2ab/assembly/asm2b:
total 1622072
-rwx------@ 1 cbergman  MYID\Domain Users  816690023 Mar 14  2022 Dovetail_HG002_phase2_scaffolds.fa.gz

asm3a/assembly/asm3a:
total 1589280
-rwx------@ 1 cbergman  MYID\Domain Users  811399219 Mar 14  2022 asm.fa.gz

asm3bc/assembly/asm3b:
total 1590792
-rwx------@ 1 cbergman  MYID\Domain Users  809426420 Mar 14  2022 pri_asm.fa.gz

asm3bc/assembly/asm3c:
total 1507408

If you are still having trouble getting started, please come to class on Tuesday to discuss more.