superphy / spfy

Spfy: an integrated graph database for real-time prediction of Escherichia coli phenotypes and downstream comparative analyses
https://lfz.corefacility.ca/superphy/grouch/
Apache License 2.0
4 stars 2 forks source link

update sideload.py to process enterobase set on corefacility #240

Closed kevinkle closed 6 years ago

kevinkle commented 7 years ago

will likely have to chunk the files for disk space considerations

kevinkle commented 7 years ago
[claing@superphy enterobase_db-fixed]$ ls -1 | wc -l
47315
[claing@superphy enterobase_db-fixed]$ pwd
/Warehouse/Users/claing/enterobase_db-fixed
kevinkle commented 7 years ago
>>> directory = '/Warehouse/Users/claing/enterobase_db-fixed'
>>> for root, dirs, files in os.walk(os.path.abspath(directory)):
...         for file in files:
...             if os.path.splitext(file)[1] in ('.fna', '.fasta'):
...                 list_files.append(os.path.join(root, file))
...
>>> len(list_files)
47315
kevinkle commented 7 years ago

Going to chunk this div 7 (6760 genomes per batch)

kevinkle commented 7 years ago
>>> chunk(6760, list_files, '/home/claing/chunk/')
[claing@superphy chunk]$ ls
batch_0_6760.p       batch_20280_27040.p  batch_33800_40560.p  batch_6760_13520.p
batch_13520_20280.p  batch_27040_33800.p  batch_40560_47315.p
kevinkle commented 7 years ago
>>> def move(p, dst):
...     import shutil
...     import cPickle as pickle
...     l = pickle.load(open(p, 'rb'))
...     for f in l:
...         shutil.copy(f,dst)
...
>>> move('/home/claing/chunk/batch_0_6760.p','/docker/chunk/')
kevinkle commented 7 years ago

Need to add check for enterobase files with spaces in the name https://sentry.io/share/issue/d2755f84948b420e97f0d52568b8c02b/

Using rename ' ' '_' * in bash for now

kevinkle commented 7 years ago

Fixed and merged in https://github.com/superphy/backend/pull/248/

kevinkle commented 6 years ago

Making some edits to this to fix some errors and for ease of use.

kevinkle commented 6 years ago

Up as of https://github.com/superphy/backend/commit/ed241cc8ed9088521a6dafefc37dd93343b11e2d . Closing