skinniderlab / CLM

MIT License
0 stars 0 forks source link

support for gz files throughout the pipeline; more tests #159

Closed vineetbansal closed 3 months ago

vineetbansal commented 3 months ago

Some tweaks and associated tests for the read_csv_file/write_to_csv_file to work in all possible combinations of compression/iterable vs dataframe/header or not etc. If I've missed any testing scenarios, feel free to add them.

I've modified Snakefile so that all *.csv files are now *.csv.gz, and everything seems to be working correctly (except for issue #158, so that file hasn't been modified).

The final checksum in test_snakemake.py remains the same - just the way we get it is different. In a future commit we don't need to decode/encode the contents of the file we need to checksum, we can checksum it directly (obviously the checksum will change whenever we do that).

I don't think the steps in Snakefile_model_eval can handle compressed files yet? That's the change coming up in the next commit..

vineetbansal commented 3 months ago

EDIT: After making the last commit on Snakemake_model_eval file (along with a tiny change in a single line that was counting the total number of rows), I've verified that it does work correctly (as in proceeds to completion without errors), when working on the pipeline output from test_snakemake.py (with 10k sample molecules instead of 100) + when the changes in PR #160 are incorporated.