marbl / CHM13

The complete sequence of a human genome
Other
883 stars 96 forks source link

CHM13 #14

Closed duhuipeng closed 3 years ago

duhuipeng commented 3 years ago

dear author image I'd like to ask, when you assemble the centromere which data pool is applyed,I want to repeat this article

skoren commented 3 years ago

Which centromere are you talking about, there are multiple released assemblies? For the X, we relied on rel2 data which did not include sequencing from university of washington nor UCDavis (it was only the first 144 partitions) to make the assembly. Some rel3 data was used for validating the structure of the centromere.

duhuipeng commented 3 years ago

Dear author I calculated X centromere about 120 M , I read in the article that there are 12 reads for centromere assembly,But I only found five reads ID in this centromere region,I thought the area I was looking for was wrong, but I found that the centromere area provided in the centroFlye article didn't have these reads ID either,Finally, in the raw data of rel2, I found that there was no reads ID sequence ,I would like to ask what is the cause of this? How do I get these reads ID image Here is the reads ID provided in the article image This is what I searched from the raw data and didn't find And the other six reads ID are the same

skoren commented 3 years ago

The centromere is definitely not 120mb, it's on the order of 3mb. As that figure legend says, the tiling path is for illustrative purposes and was not how the original centromere was assembled ("(d) A minimum tiling path was reconstructed for illustration purposes (as shown in Fig 2a) and was not the mechanism for initial assembly."). For that reason, the reads aren't restricted to rel2, as I said above rel3 was used to validate the structure of the assembly. Thus, about half are only present in rel3 (the read you list is one of those).

Of the twelve, the following 5 are in rel2:

64d464d1-f317-4dff-a259-de6097a5cd4c
1ccd919f-5726-4d79-8cfe-fe2b344070a1
3d0fa869-028f-45be-be41-b2487897bb25
e39308c6-0c73-45d5-9b8d-7f764af858be
063fca09-81fc-4c2d-81ad-16fb2bfee76f
duhuipeng commented 3 years ago

I'd like to correct it first. Maybe I have a misunderstanding,The centromere I recruited to the X chromosome reads about 120 M Yes, I found five reads ID you said in the rel2, I'm curious, should n' t it be done first and then verified, that means the assembled reads shouldn't all come from rel2? Will you verify other reads ID in the rel3 to replace the other sequences in the rel2(assembled reads ) ?

duhuipeng commented 3 years ago

Dear author Can not be assembled with rel2 data alone? I have another question. I hope to get your advice l found 2 kb HOR, although several bases are inconsistent with the centroFlye, I think maybe due to sequencing errors I do this :Use HMMER to determine SV location informationThe results are as follows: image image

What I see in this article is that 063 and 3d0 read ID are extended overlap each other, Now I don't know how to view the two sequences through HMMER software overlap, Can you indicate which column to observe the shared SV between them

Could you tell me how do I identify SV through the hmmer ,Based on what to extract and determine which type of SV it is In other words,Outputs through HMMER files, How do I determine the type of his SV and how to identify and extract it,Previously I used HMMER to extract sequences to build database by'['],But I don't know how to operate this time. Could you give me some advice

skoren commented 3 years ago

The centromere can be assembled with rel2 data though there is one region lacking SVs which we had to rely on centroflye's resolution for. However, as I already said before, the figure you're pulling read IDs from in the paper is an illustration and not how the centromere was assembled so it includes validation data from rel3. The original assembly used more reads and was used to form contigs to have more accurate consensus as in part b of extended figure 5.

HMMER doesn't find overlaps so you can't use it in that way. I'm pretty sure that screenshot isn't nearly the full read so the SVs don't have to be listed there. You need to find non-canonical versions of the repeat within each read by looking at the HMMER matches that don't look like the canonical repeat unit, from the paper: "Reads containing alpha in the reverse orientation were reverse complemented, and screened with HMMER (v3) using a 2057 bp DXZ1 repeat unit. We then employed run-length encoding in which runs of the 2057 bp canonical repeat (defined as any repeat in the range of min: 1957 bp, max: 2157 bp) were stored as a single data value and count, rather than the original run."

duhuipeng commented 3 years ago

Thank you for your reply,I am Download rel3 data,See if I can reproduce the results