Open AhmedArslan opened 1 year ago
Hey, I believe one of the main (if not only?) uses of these packages is for MungeSumstats hance why I'm answering this. The creation of supplementary dbSNP release packages is something that has been discussed here.
The TLDR is that it is very RAM intensive and time consuming to create these packages (on the scale of 80 cpus and 384 Gb RAM running for a week for each package) and so isn't really feasible using the current approach. Really we need to refactor the approach is done which isn't something @hpages or me have had time to do.
At least dbSNP154 is slightly smaller than dbSNP155 (729,491,867 RS count vs 1,085,850,277) so the requirements won't be so bad.
Let me know if you want to give this a try @AhmedArslan, by following the overview of the process I provided here. I'll be happy to answer questions and provide more detailed guidance if needed.
Best, H.
@Al-Murphy Actually now that I look at the numbers, I see that size of dbSNP156 is 1,130,597,309 RS count which is really not that much bigger than dbSNP155 (only 4% bigger), especially compared to the growth between dbSNP154 and dbSNP155, which was 49%. So maybe I'll give a shot at forging SNPlocs.Hsapiens.dbSNP156.GRCh38 and SNPlocs.Hsapiens.dbSNP156.GRCh37 after all, in the next couple of weeks or so.
At least dbSNP154 is slightly smaller than dbSNP155 (729,491,867 RS count vs 1,085,850,277) so the requirements won't be so bad.
Let me know if you want to give this a try @AhmedArslan, by following the overview of the process I provided here. I'll be happy to answer questions and provide more detailed guidance if needed.
Best, H.
@hpages only limitation is that I do not have resources to perform such intensive analysis. Although if dbSNP155 is broadly different from dbSNP154 (as you mentioned) in terms of SNP ids, perhaps its essential to produce dbSNP154?
Although if dbSNP155 is broadly different from dbSNP154 (as you mentioned) in terms of SNP ids
Well, all I'm saying is that dbSNP155 has a lot more SNP ids than dbSNP154. That doesn't mean that the SNP ids in the latter are not in the former.
IIUC dbSNP builds are incremental with every new build mostly adding new SNPs to the previous one and making some corrections to the existing ones. So I would expect dbSNP155 to be a superset of dbSNP154 i.e. that most of the SNP ids found in the latter are still in the former. In other words, I would imagine that using dbSNP155 would still cover your use case.
In the unlikely case that the SNPs in dbSNP154 have changed so much in dbSNP155 that the latter cannot be used to annotate the SNPs in the former, then this would suggest that the data in dbSNP154 is outdated, and that the GWAS Catalogue should probably be updated to be based on dbSNP155 in order to remain relevant.
What's the plan anyways for the GWAS Catalogue? How often do they switch to a more recent dbSNP build? dbSNP 154 is more than 3 year old now so maybe it's time.
So ealier today I asked the GWAS folks about their plans to map to a more recent dbSNP build and I got the following answer:
Hi Hervé,
Thanks for your interest in the GWAS Catalog. We use dbSNP mappings from Ensembl, which is currently on Build 154. However, we expect that with the next release scheduled for this month, the mapping will be updated to dbSNP 156. See Ensembl’s page here: https://www.ensembl.info/2023/09/13/whats-coming-in-ensembl-111-ensembl-genomes-58/
I understand that build 155 will be skipped.
Best wishes,
Elliot Sollis
GWAS Catalog Curator
> On 6 Nov 2023, at 18:44, Hervé Pagès via gwas-info <gwas-info@ebi.ac.uk> wrote:
>
> Hi,
>
> Are there any plans to update the GWAS catalogue to map it to dbSNP Build 155 or 156 instead of dbSNP Build 154?
>
> Is there a timeline for that?
>
> Thanks,
>
> H.
> --
> Hervé Pagès
>
> Bioconductor Core Team
> [hpages.on.github@gmail.com](mailto:hpages.on.github@gmail.com)
One more reason to focus on dbSNP156!
I will start working on SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37] this week.
Hi @hpages, has there been any update on work for working on SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37]? Ideally I would love to add them to MungeSumstats when available!
Thanks for the ping.
The bad news is that we had many technical problems with the powerful server that I use for these things. The server is back but it's not the first time that the IT people manage to bring it back. However they've always done it without really addressing the root causes so almost zero progress has been made to improve reliability.
Anyways I'm trying to run again my scripts there but my expectations are low. Fingers crossed.
The other bad news is that it turns out that these huge SNPlocs packages have contributed significantly to our egress costs in a way that is not sustainable for the Bioconductor project. The current format and mode of distribution is inadequate and will need to change. However I don't have the bandwidth at the moment for this kind of refactoring. So if our server doesn't let me down and I manage to actually produce SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37] then I'll put the tarballs on an egress-free location for you to manually download.
This will be a temporary situation until I find the time to refactor these packages.
Good news is that extract_snvs_from_RefSNP_json_files.sh
completed :tada: This is by far the most resource intensive step in the pipe. What's surprising is that it took "only" 67h to run, which is fast compared to the 100h it took for dbSNP155 on the same server a couple of years ago. I don't want to call this good news though before I understand the reason behind such a big difference. It could actually hide something bad.
Anyways, going to run the next steps: select_GRCh38_snvs.sh
+ build_GRCh38_OnDiskLongTable.sh
and select_GRCh37_snvs.sh
+ build_GRCh37_OnDiskLongTable.sh
. These are expected to take a few hours only...
Oops select_GRCh38_snvs.sh
fails with dbSNP156 because some rs ids are too big to fit in an integer (e.g. rs2147714790). Switching to use a double vector instead of an integer vector to store the billion or so rs ids. Unfortunately this will make the resulting SNPlocs.Hsapiens.dbSNP156.[GRCh38|GRCh37] packages significantly bigger. :disappointed:
The dbSNP156 packages are ready to go!
Here are some numbers:
Nb of SNPs (i.e. nb of rs ids):
So not a tremendous increase between dbSNP155 and dbSNP156 (only about 4.2%).
Sizes of the source tarballs:
Note that the dbSNP156 packages require a machine with at least 16G or RAM instead of 10G for the dbSNP155 packages. This increase in memory footprint is due to the fact that the rs ids are now stored in a double vector instead of an integer vector (see my previous comment above for why this change was needed). This change also slows down the loading into memory of the rs ids vector (this loading happens the first time one of the snpsBy*()
function is called). It also makes the source tarballs slightly bigger.
But the most crazy number about these new packages is that R CMD build
takes more than 3 hours to complete despite the fact that the packages have no vignettes! Luckily that doesn't affect the end user, only me :sweat:
I'll move the two packages here soon where they'll be available for download. IMPORTANT: They both require BSgenome >= 1.73.1 which will only become available in Bioconductor 3.20 in the next couple of days.
Thank you Herve! I will add functionality to MungeSumstats so users can supply these to use dbSNP 156
The two packages are finally available at http://149.165.171.124/SNPlocs/
Hello, I would like to request if you could create dbSNP154.GRCh38/dbSNP154.GRCh37 or provide guidance to built dbSNP154. GWAS Catalogue uses dbSNP154 version and this could be helpful for help working on GWAS data.
Many thanks.