nlapier2 / Metalign

Metalign: efficient alignment-based metagenomic profiling via containment min hash
MIT License
32 stars 7 forks source link

Building the Metalign database from different sources #22

Open nlapier2 opened 4 years ago

nlapier2 commented 4 years ago

It would be nice to be able to re-build the Metalign database in the same way that it was originally built (although it would be different, and larger, due to more genome assemblies being added to NCBI). A script exists that sort-of does this (given that all NCBI genome assemblies have already been grabbed via rsync), but it needs to be updated and polished: https://github.com/nlapier2/Metalign/blob/master/utils/ncbi2db.py

Also, it would be nice to have wrappers for other resources such as JGI, and also to allow users to build custom databases easier.

dkoslicki commented 4 years ago

The following give an example of how to re-train everything, and then make a simple mock community to test metalign with the new training data. This is mostly for my future reference as it's integrated. See the example here.

Note this is run in a subdirectory of Metalign (eg. Metalign/local_tests) and requires the following simple script.

vrou1995 commented 3 years ago

Hi,

I tried to retrain the model but I came across an issue at the K-mer dumping set. The error said that I could not "import MinHash as MH" even though I have cmash in my Conda "metalign" environment? Is there perhaps an issue with the sys.path?

Many thanks,

Vincent

dkoslicki commented 3 years ago

@vrou1995 Python imports and sys.path are the bane of my existence. I usually solve it with a bunch of sys.path.inserts along with try,except. See here for an example. It might also be an issue of where you are calling the script from, but without knowing your path, it’s hard to diagnose. Hopefully that points you in the right direction, otherwise, LMK.

vrou1995 commented 3 years ago

Hi @dkoslicki I'm still getting stuck at the prefilter step, I have a server with 64GB RAM but its been running for many weeks. Do you have any idea why this might be?