Closed kevinkle closed 6 years ago
[claing@superphy enterobase_db-fixed]$ ls -1 | wc -l
47315
[claing@superphy enterobase_db-fixed]$ pwd
/Warehouse/Users/claing/enterobase_db-fixed
>>> directory = '/Warehouse/Users/claing/enterobase_db-fixed'
>>> for root, dirs, files in os.walk(os.path.abspath(directory)):
... for file in files:
... if os.path.splitext(file)[1] in ('.fna', '.fasta'):
... list_files.append(os.path.join(root, file))
...
>>> len(list_files)
47315
Going to chunk this div 7 (6760 genomes per batch)
>>> chunk(6760, list_files, '/home/claing/chunk/')
[claing@superphy chunk]$ ls
batch_0_6760.p batch_20280_27040.p batch_33800_40560.p batch_6760_13520.p
batch_13520_20280.p batch_27040_33800.p batch_40560_47315.p
>>> def move(p, dst):
... import shutil
... import cPickle as pickle
... l = pickle.load(open(p, 'rb'))
... for f in l:
... shutil.copy(f,dst)
...
>>> move('/home/claing/chunk/batch_0_6760.p','/docker/chunk/')
Need to add check for enterobase files with spaces in the name https://sentry.io/share/issue/d2755f84948b420e97f0d52568b8c02b/
Using rename ' ' '_' *
in bash for now
Fixed and merged in https://github.com/superphy/backend/pull/248/
Making some edits to this to fix some errors and for ease of use.
will likely have to chunk the files for disk space considerations