psychoinformatics-de / ClusteredNetwork_pub

0 stars 1 forks source link

Have source dataset with archivable file format #8

Open mih opened 4 months ago

mih commented 4 months ago

Here is a log. This could have been done simultaneously with #7, but we do not know yet, if we can actually switch to this format.

# another dedicated dataset, again no annex because data
# are small and public
❯ datalad create --no-annex joe_and_lili
❯ cd joe_and_lili
# link dataset with data files in pickle format
❯ datalad clone -d . https://github.com/psychoinformatics-de/joe_and_lili_pickle.git source/joe_and_lili_pickle
❯ mkdir code
# invent little conversion helper
❯ cat << EOT > code/convert.py
`heredoc> from numpy import savez
import pandas as pd
import sys

# barf whenever there are not exactly two arguments
infile, outfile = sys.argv[1:]

# read the pickle
df = pd.read_pickle(infile)
# write Numpy's stable/simple npz format
savez(outfile, **df)
EOT
# save script in dataset so we know exactly what ran
❯ datalad save -m "Script to convert from pickle to Numpy's NPZ format"
# run conversion, capture provenance
❯ datalad run -m "Convert data to NPZ format" -i code -i source/joe_and_lili_pickle/data -o data sh -c 'mkdir -p data && for f in source/joe_and_lili_pickle/data/*.pickle; do echo $f; python code/convert.py "$f" "data/$(basename ${{f%.pickle}}).npz"; done'
# run conversion on the `toc` file which uses a different
# naming schema, but otherwise is the exact same thing
# we need to adjust slightly nevertheless
❯ cat << EOT > code/convert_toc.py
`heredoc> from numpy import savez
import pandas as pd
import sys

# barf whenever there are not exactly two arguments
infile, outfile = sys.argv[1:]

# read the pickle
df = pd.read_pickle(infile)
# we need to rename the 'file' key for compatibility with savez()
# we also recode the filenames to match the new format, and
# convert to UTF8 strings
df['files'] = [f'{f[:-7].decode()}.npz' for f in df.pop('file')]
# write Numpy's stable/simple npz format
savez(outfile, **df)
EOT
❯ datalad save -m "Script to convert TOC from pickle to Numpy's NPZ format" code
❯ datalad run -m "Convert 'toc' to NPZ format" -i code -i source/joe_and_lili_pickle/data/toc -o data/toc.npz sh -c 'mkdir -p data && python code/convert_toc.py "{inputs[1]}" "{outputs}"'
❯ datalad run -i source/joe_and_lili_pickle/LICENSE -o LICENSE 'sh -c "cp -LRv {inputs} {outputs}"'

❯ git remote add origin git@github.com:psychoinformatics-de/joe_and_lili.git
❯ git branch -M main
❯ git push -u origin main

Outcome is at https://github.com/psychoinformatics-de/joe_and_lili