pachterlab / seqspec

machine-readable file format for genomic library sequence and structure
MIT License
114 stars 17 forks source link

using region.max_len for format and check is a problem for nanopore reads #47

Closed detrout closed 3 months ago

detrout commented 4 months ago

It's unclear what the max length for a nanopore seqspec read should be.

I tried using 2 million as that's what the reported, however this causes problems for seqspec check, format, and reasonably sized yaml files.

Currently the checks require the sequence lengths match the max length as a fixed string, needless to say with a 2 million basepair max_len this leads to 4 megabyte yaml file.

Some options might be to use the min length, for sequence lengths, or to implement some kind of run length encoding for the sequence strings.... instead of a massive list of Xs. Perhaps the sequence string could do something like: X{2000000} instead.

sbooeshaghi commented 3 months ago

Thank you for the suggestion to use the min len. I have changed the code accordingly.

https://github.com/pachterlab/seqspec/commit/7c1a6a58e16e0f49eb95b37e3cbab369357d6414