Patch for _gzip_iterator in tokenise_bio.py

tyronechen / genomenlp

https://genomenlp.readthedocs.io/en/latest/

MIT License

5 stars 3 forks source link

Patch for _gzip_iterator in tokenise_bio.py #7

Closed stepwise-ai-dev closed 1 year ago

stepwise-ai-dev commented 1 year ago

The _gzip_iterator function in v2.7.1 attempted to manipulate the read object itself, leading to an error. This approach extracts the sequence from the read object and then applies the desired transformations (such as case changes or breaking into chunks) directly to the sequence string. This should leverage screed's native support, resolving the error and ensuring the function's intended behavior.

tyronechen commented 1 year ago

Tested, OK:

${script} -i ${infile_path} -v 1000 -t tok_default.json
${script} -i ${infile_path} -v 1000 -t tok_upper.json -c upper
${script} -i ${infile_path} -v 1000 -t tok_lower.json -c lower
${script} -i ${infile_path} -v 1000 -t tok_break.json -b 50
${script} -i ${infile_path} -v 1000 -t tok_upper_break.json -c upper -b 50
${script} -i ${infile_path} -v 1000 -t tok_lower_break.json -c lower -b 50

This fixes the bug created in 2.7.1 for Issue #7