marbl / CHM13

The complete sequence of a human genome
Other
882 stars 96 forks source link

about naming difference in the same database file #97

Closed zincuum closed 3 weeks ago

zincuum commented 1 month ago

Dear T2T-CHM13 Teams

Thank you for making this amazing set of resources available. I have a question about the naming convention for the files you provide. If you go to the download section of T2T github, there are files related to vcf calls. Here, ClinVar20220313, which liftover the GRCh38 file, is on the main screen, and if you download it, it is the same as chm13v2.0_ClinVar20220313.vcf. And when you enter FTP, there is another T2T-CHM13 Clinvar file, Homo_sapiens-GCA_009914755.4-2022_10-clinvar.vcf. I want to know the difference between these two files.

  1. First of all, I am curious as to when it is named chm13 and when it is named GCA_009914755.4.
  2. I am curious about the difference in terms of content between these two files.
arangrhie commented 1 month ago

Hi @zincuum,

Sorry, I am confused - which FTP are you referring to? GCA_009914755.4 is the GenBank accession for the submitted assembly.

zincuum commented 1 month ago

Hi @zincuum,

Sorry, I am confused - which FTP are you referring to? GCA_009914755.4 is the GenBank accession for the submitted assembly.

In the attached captured photo, I am talking about the dragged FTP. The link for this is as follows: https://ftp.ensembl.org/pub/rapid-release/species/Homo_sapiens/GCA_009914755.4/ensembl/variation/2022_10/vcf/

FTP_snpshot
arangrhie commented 1 month ago

I believe the data in that Ensembl FTP were created from the GRCh38-based HAL file, downloaded from the Minigraph-Cactus alignment of Liao et al. 2022 (Liao et al. 2022). The rest of the 'lifted' data listed in this github were generated with using a curated chain file between GRCh38 and CHM13 alignments.

zincuum commented 1 month ago

Thanks for your reply. Could you please confirm if I understood correctly? What you mean is that both files are created by lifting over GRCH38, but the difference is that Ensembl FTP uses a grch38 file created by Minigraph-Cactus alignment as input, while the one on github uses a grch38 file curated by NCBI?

arangrhie commented 3 weeks ago

Yes, I confirmed with the Esembl team. The ClinVar was lifted over from GRCh38 using the Minigraph-Cactus alignment as done for the gnomAD data. The 1KGP and SGDP dataset in the FTP are identical datasets as from the Y paper.. See supplementary Notes for more details. Sorry for the confusion.