jts / nanopolish

Signal-level algorithms for MinION data
MIT License
565 stars 159 forks source link

Segmentation fault in index when including -s #1006

Open franrodalg opened 2 years ago

franrodalg commented 2 years ago

Hi @jts,

I am currently running nanopolish index on two direct RNA libraries, one obtained from a MinION and the other from a PromethION. The former seems to have worked perfectly fine when including the -s option to point at its sequencing_summary.txt file, but the same call was causing the latter to crash, throwing a Segmentation Fault error. This seems to be solved if I remove the -s option.

Due to being a larger run, the guppy basecalling of the PromethION library actually split the sequencing_summary.txt file in two, with the first part being stored as sequencing_summary.txt.prev. I was including only sequencing_summary.txt as part of -s. Could this be the problem? If so, how should I address it?

The documentation for index doesn't seem to provide any information on how to use the -s parameter.

I am using Nanopolish version 0.13.2, installed through within a conda environment.

Cheers, Fran

jts commented 2 years ago

Can you check whether the sequencing summary file format is the same for the MinION and PromethION data? Is it possible the split sequencing summary file is truncated (does it have the right number of fields on every line?)

franrodalg commented 2 years ago

The *.prev file does seem to be truncated

$ tail -n 3 [...]/sequencing_summary.txt.prev
PAG50866_pass_a8c5e533_15.fast5 2a66c331-32e9-41c2-b386-eb15743b4918    a8c5e5330539ae4e22df7db35b72c833dbb28282    91  517 2   8131.663000 79.819000   23945   TRUE    8134.889000 22977   76.593000   3394    9.965094    0.000000    76.369461   13.398151   76.369461   13.398151
PAG50866_pass_a8c5e533_15.fast5 71680880-2add-49c9-b327-b00aefb024ec    a8c5e5330539ae4e22df7db35b72c833dbb28282    91  2792    3   7748.188667 56.772000   17031   TRUE    7751.356333 16081   53.604333   2489    8.590801    0.000000    77.709274   13.934077   77.709274   13.934077
PAG50866_pass_a8c5e533_15.fast5

$ tail -n 3 [...]/sequencing_summary.txt
PAG50866_pass_a8c5e533_31.fast5 0c524be7-fb98-4e80-bca2-0008bf76c572    a8c5e5330539ae4e22df7db35b72c833dbb28282    119 2791    3   17481.855333    35.111333   10533   TRUE    17482.014667    10485   34.952000   1784    9.106236    0.000000    87.623909   14.470003   87.623909   14.470003
PAG50866_pass_a8c5e533_31.fast5 11739cd3-6df1-48d2-af3b-4ba0249ce8ef    a8c5e5330539ae4e22df7db35b72c833dbb28282    119 2625    4   17733.792667    71.777333   21533   TRUE    17737.036000    20560   68.534000   3028    10.361650   0.000000    80.656868   13.398151   80.656868   13.398151
PAG50866_pass_a8c5e533_31.fast5 90898a45-9d38-4394-b4cc-e4e4873b546d    a8c5e5330539ae4e22df7db35b72c833dbb28282    119 293 1   17766.737000    56.047000   16814   TRUE    17769.317667    16039   53.466333   1786    9.651551    0.000000    81.460762   15.005929   81.460762   15.005929

but I assume that is normal when it guppy times out and it gets resumed. Would you recommend deleting the last row and retrying the indexing?

jts commented 2 years ago

If the file is truncated it means guppy terminated abnormally so didn't close the file properly. If you only passed sequencing_summary.txt to nanopolish (and not sequencing_summary.txt.prev) then the truncation shouldn't cause the seg fault (nanopolish would not try to read the .prev file). Can I see the head of sequencing_summary.txt and the summary file from the MinION run?

franrodalg commented 2 years ago

For the PromethION run:

$ head -n 5 [...]/sequencing_summary.txt
filename    read_id run_id  batch_id    channel mux start_time  duration    num_events  passes_filtering    template_start  num_events_template template_duration   sequence_length_template    mean_qscore_template    strand_score_template   median_template mad_template    scaling_median_template scaling_mad_template
PAG50866_fail_a8c5e533_4.fast5  d832e3ea-d137-49a6-9d0a-b4f130341039    a8c5e5330539ae4e22df7db35b72c833dbb28282    0   508 3   16116.210667    5.284333    1585    FALSE   16118.039667    1036    3.455333    131 5.806077    0.000000    88.963722   9.110743    88.963722   9.110743
PAG50866_fail_a8c5e533_4.fast5  571e8c02-5a4e-4370-b735-a2b53c1eab7a    a8c5e5330539ae4e22df7db35b72c833dbb28282    0   1333    2   15261.546000    19.430667   5829    FALSE   15264.969000    4802    16.007667   603 5.504918    0.000000    81.996681   10.986484   81.996681   10.986484
PAG50866_fail_a8c5e533_4.fast5  f784e9eb-d5a9-4282-a81d-3d1cc3667533    a8c5e5330539ae4e22df7db35b72c833dbb28282    0   649 1   15757.121000    10.761000   3228    FALSE   15757.424000    3137    10.458000   458 5.128955    0.000000    77.173347   14.470003   77.173347   14.470003
PAG50866_fail_a8c5e533_4.fast5  c6b78690-436e-438a-8d72-7ce37c4274db    a8c5e5330539ae4e22df7db35b72c833dbb28282    0   697 1   15169.492000    21.781667   6534    FALSE   15169.802667    6441    21.471000   1042    5.794460    0.000000    71.010201   11.254447   71.010201   11.254447

and for the MinION run:

$ head -n 5 [...]/sequencing_summary.txt 
filename    read_id run_id  batch_id    channel mux start_time  duration    num_events  passes_filtering    template_start  num_events_template template_duration   sequence_length_template    mean_qscore_template    strand_score_template   median_template mad_template    scaling_median_template scaling_mad_template
FAO33153_fail_1ddb64de_28.fast5 e15c1fb1-5586-472e-ba58-8d08c3e1093c    1ddb64de2c69221c6b30bbf05c892f79f2199f1d    0   468 1   11924.139774    64.340637   19379   TRUE    11925.856906    18862   62.623506   3530    7.233658    0.000000    85.713989   12.361288   85.713989   12.361288
FAO33153_fail_1ddb64de_28.fast5 9a5f2bba-0da6-49c4-bfc0-6001427ad246    1ddb64de2c69221c6b30bbf05c892f79f2199f1d    0   452 2   12186.254980    2.010956    605 FALSE   12186.254980    605 2.010956    112 6.817979    0.000000    118.179344  15.078054   118.179344  15.078054
FAO33153_fail_1ddb64de_28.fast5 dfbdca83-b41d-4e8b-97cf-2b008877a465    1ddb64de2c69221c6b30bbf05c892f79f2199f1d    0   327 3   12146.663347    8.406707    2532    FALSE   12150.514940    1372    4.555113    99  5.765373    0.000000    68.055000   9.508683    68.055000   9.508683
FAO33153_fail_1ddb64de_28.fast5 ed6738ea-1371-48c8-b7e6-30aaacc40e17    1ddb64de2c69221c6b30bbf05c892f79f2199f1d    0   204 3   12069.395086    3.739044    1126    FALSE   12069.395086    1126    3.739044    125 5.432938    0.000000    82.725540   11.410419   82.725540   11.410419

If nanopolish would not try to read the *.prev file, would that mean that many reads would be ignored? Is it possible to include multiple files within -s?

jts commented 2 years ago

Unfortunately you can't include multiple summary files, but you can merge them into one file, then provide that. I think you can simply cat the files (after removing the truncated lines from .prev) - nanopolish will ignore the redundant headers.

If nanopolish would not try to read the *.prev file, would that mean that many reads would be ignored?

No, it will revert to indexing the fast5s not present in the summary the slow way (opening up each fast5 to see what reads are contained within). The summary file is only used as a hint to accelerate the indexing.

franrodalg commented 2 years ago

Thanks, Jared.

So if the file that was actually used appeared correct, do you have any suggestion of why would it have crashed? Any other test you want me to try?