nanoporetech / pod5-file-format

Pod5: a high performance file format for nanopore reads.
https://pod5-file-format.readthedocs.io/
Other
126 stars 18 forks source link

pod5 merge hangs indefinitely at 99-100%(the last 20 pod5 have not been merged) #131

Open kir1to455 opened 3 months ago

kir1to455 commented 3 months ago

Issue Description

I use pod5 merge to merge my pod5 file, I have 3320 pod5 files. It seemed to have stopped processing the last 20 pods. However, nohup told me it was done and there were no errors.

Logs

This is input group. image This is ip group. image image Here is my pod5 merge code: image Here is the size of merge_pod5 and multi_pod5: image image It seems that the last 20 pod5 have not been merged.

Specifications

HalfPhoton commented 2 months ago

Interesting. Is this running in a conda environment or python environment? We occasionally see issues when running in conda.

Are you able to merge the remaining 20 files into the ip_merge.pod5 file?

kir1to455 commented 2 months ago

Hi, @HalfPhoton

We occasionally see issues when running in conda.

I run this code in conda environment. image

Are you able to merge the remaining 20 files into the ip_merge.pod5 file?

I don't know how pod5 merge handles the order of files. Like test_0.pod5...test_1.pod5... test_20.pod5? If so, I will try to merge it.

Best wishes, Kirito

HalfPhoton commented 2 months ago

ah - I see.

In this case please create a list of missing read ids from the first merged output and all inputs using pod5 view.

# get read ids
pod5 view -IH input_data/ -o input.ids
pod5 view -IH merged.pod5 -o merged.ids

# Sort the files (comm requires sorted files)
sort input.ids > input.ids.sorted
sort merged.ids > merged.ids.sorted

# Find ids in input that are not in merged file
comm -23 input.ids.sorted merged.ids.sorted > missing.ids

# Get a pod5 file of only missing ids
pod5 filter input_data/ --ids missing.ids -o missing.pod5

# Merge in missing ids
pod5 merge merged.pod5 missing.pod5 -o merged.final.pod5
HalfPhoton commented 2 months ago

I recommend using a python virtual environment instead of a conda environment:

python3.10 -m venv venv --prompt=pod5
source venv/bin/activate
pip install -U pip pod5
pod5 --version