Closed Strexas closed 3 months ago
Hi @Strexas, sounds like you want to just copy the file?
If your question relates to the API though - no you cannot do this - each pod5 file contains a file uuid which is random.
Re-writing the file using the API will contain the same (useful) content if you did want to go that route though.
Best regards, Rich
The order of the data could also differ as you may repack the signal chunks upon rewriting. So while the exact same important data would be there, there would be differences in the file besides the uuid that's created.
One of the ways I've tried validating pod5 files are identical is if a sorted fastqs file produced by basecalling (on the same machine/environment/config) is identical. It has worked well for us so far in all of our testing between various formatting and conversions.
Thank you for answers. I'll explain my motivation, I noticed that if I read data about reads and write to binary using pickle then its total size is 10 times less, if I need I can load it back and save as .pod5 using API. As I understand, no important data will be lost if I transfer only reads from one file to another, right?
You may be interested in reading the slow5 paper if you haven't already.
https://www.nature.com/articles/s41587-021-01147-4
Blow5 is basically a binary TSV with compression on the various columns, much like how Sam/bam works.
I think you'll find pickle won't be that efficient for large datasets
@Strexas are you saying that you are only storing the read table (meta) and not the signal table (raw)? If so, it's not surprising the file size is much smaller. What are you trying to achieve? Pickle is a simple stack based vm for storing program state and not suitable / safe for long term data storage.
@Strexas are you saying that you are only storing the read table (meta) and not the signal table (raw)? If so, it's not surprising the file size is much smaller. What are you trying to achieve? Pickle is a simple stack based vm for storing program state and not suitable / safe for long term data storage.
I used this code to retreive data
with Reader('./file.pod5') as reader:
reads = [read for read in reader.reads()]
I beleive I retreived raw data as I can access it with signal
property. Please, correct me if I wrong.
We do a lot of experiments, but we don't use data very often. Storing this data even with the cheapest back up solutions is a bit expensive, and I wanted to minimise its size as much as possible without loosing any important data.
I think to switch to parquet, final size has to be even smaller and it's more suitable for storing data and faster.
@Strexas pod5 has been designed specifically for this task and unfortunately there really aren't any major lossless gains to be had.
thank you for help.
Hello,
I have question if it's possible to read data from file and then create new file and write data there to get two files with exactly the same content, every single bit should match.
Best regards, Dainius