nanoporetech / pod5-file-format

Pod5: a high performance file format for nanopore reads.
https://pod5-file-format.readthedocs.io/
Other
139 stars 18 forks source link

File reading/writing capabilities #137

Closed Strexas closed 3 months ago

Strexas commented 3 months ago

Hello,

I have question if it's possible to read data from file and then create new file and write data there to get two files with exactly the same content, every single bit should match.

Best regards, Dainius

HalfPhoton commented 3 months ago

Hi @Strexas, sounds like you want to just copy the file?

HalfPhoton commented 3 months ago

If your question relates to the API though - no you cannot do this - each pod5 file contains a file uuid which is random.

Re-writing the file using the API will contain the same (useful) content if you did want to go that route though.

Best regards, Rich

Psy-Fer commented 3 months ago

The order of the data could also differ as you may repack the signal chunks upon rewriting. So while the exact same important data would be there, there would be differences in the file besides the uuid that's created.

One of the ways I've tried validating pod5 files are identical is if a sorted fastqs file produced by basecalling (on the same machine/environment/config) is identical. It has worked well for us so far in all of our testing between various formatting and conversions.

Strexas commented 3 months ago

Thank you for answers. I'll explain my motivation, I noticed that if I read data about reads and write to binary using pickle then its total size is 10 times less, if I need I can load it back and save as .pod5 using API. As I understand, no important data will be lost if I transfer only reads from one file to another, right?

Psy-Fer commented 3 months ago

You may be interested in reading the slow5 paper if you haven't already.

https://www.nature.com/articles/s41587-021-01147-4

Blow5 is basically a binary TSV with compression on the various columns, much like how Sam/bam works.

I think you'll find pickle won't be that efficient for large datasets

iiSeymour commented 3 months ago

@Strexas are you saying that you are only storing the read table (meta) and not the signal table (raw)? If so, it's not surprising the file size is much smaller. What are you trying to achieve? Pickle is a simple stack based vm for storing program state and not suitable / safe for long term data storage.

Strexas commented 3 months ago

@Strexas are you saying that you are only storing the read table (meta) and not the signal table (raw)? If so, it's not surprising the file size is much smaller. What are you trying to achieve? Pickle is a simple stack based vm for storing program state and not suitable / safe for long term data storage.

I used this code to retreive data

with Reader('./file.pod5') as reader:
    reads = [read for read in reader.reads()]

I beleive I retreived raw data as I can access it with signal property. Please, correct me if I wrong.

We do a lot of experiments, but we don't use data very often. Storing this data even with the cheapest back up solutions is a bit expensive, and I wanted to minimise its size as much as possible without loosing any important data.

I think to switch to parquet, final size has to be even smaller and it's more suitable for storing data and faster.

iiSeymour commented 3 months ago

@Strexas pod5 has been designed specifically for this task and unfortunately there really aren't any major lossless gains to be had.

Strexas commented 3 months ago

thank you for help.