skovaka / uncalled4

MIT License
42 stars 4 forks source link

Refstats usage #23

Open physnano opened 3 months ago

physnano commented 3 months ago

Hi Sam,

Thanks for developing uncalled4 (much needed!). I am trying to obtain reference level statistics (KS comparison between two samples processed with uncalled4 align and I am running into some issues. I am working off the dev branch for RNA004 chemistry. I was able to generate aligned bams via:

uncalled4 align ${refPath} ${pod5Path} --bam-in ${bamIn} -p 6 --flowcell FLO-MIN004RA --kit SQK-RNA004 -o ${bamOut}'_align.bam'

However I cannot seem to figure out the refstats call. Can you advise on the formatting of the following refstats command?:

uncalled4 refstats -c -o out.dat dtw.current,dtw.dwell ks,mean,stdv ${bam1In} ${bam2In}

The errors I am seeing are an extensive list of the following flavor (shortened for breivity) but terminating in a "bus error" sometimes with different numbers preceeding:

  refstats = groups.agg(stats.layer_agg).reset_index().pivot(index=["seq.pos","seq.fwd"], columns="aln.track")
<<<<<PATH>>>>>/.conda/envs/uncalled4_dev/lib/python3.12/site-packages/uncalled4/tracks.py:1520: FutureWarning: The provided callable <function std at 0x2aaab514d120> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  refstats = groups.agg(stats.layer_agg).reset_index().pivot(index=["seq.pos","seq.fwd"], columns="aln.track")
/var/spool/slurm/job37556276/slurm_script: line 25:  1220 Bus error               uncalled4 refstats -c -o out.dat dtw.current,dtw.dwell mean,stdv ${bam1In} ${bam2In}

Thank you

-Will

skovaka commented 3 months ago

It's possible the "bus error" could be caused by too much memory usage. Is anything output to out.dat, or does it crash before outputting? Also, do simpler versions work, like ...dtw.current mean ${bam1In}, ...dtw.current ks ${bam1In} ${bam2In}, etc? Let me know if any combinations work, or if they produce any output before failing

Could you also save the full error log (2> out.err) and attach it? It looks like mostly pandas version-related warnings which I can try to clean up

physnano commented 3 months ago

I'm trying a simpler version now (one bam input, one layer, one refstats), will let you know how it goes. A previous run I did see the header output, but no data. For this recent "bus error" the output file was empty. Here is the output (.err): out.txt

physnano commented 2 months ago

OK so the issue is definitely a memory usage issue. I retried with a smaller dataset input and get the following output: out3.txt.

As for the formatting of output, is it possible for the "ref" column to reflect the position within the given reference (column 1) rather than a position continuation (see transition from refA line to refB ~ line 708 and refB to refC transition, line ~3766). In other words, should the "ref" value/position (second column) continually increase across multiple different ref_names (column 1) or reset once there is a new ref_name?

skovaka commented 2 months ago

I just pushed some updates to the dev branch which cleaned up some refstats warnings, removed redundant "coverage" columns, and possibly reduced memory usage.

The "ref" column coordinate issue looks like a bug, but I wasn't able to reproduce it. I had recently made some changes to coordinate indexing, and it's possible that I accidentally pushed a broken version to dev that I later fixed after you pulled. Let me know if the coordinates are still wrong and I can take another look.

physnano commented 2 months ago

Hi Sam, I was able to pull uncalled4 again and generate the required output using refstats but it seems the ref column coordinate indexing is still continuing through the different ref_names

physnano commented 2 months ago

Also I am now attempting to run the following: uncalled4 refstats -c -p 6 -o out_ks.dat dtw.current ks ${bam1In} ${bam2In}

And I receive the following error:

Traceback (most recent call last):
  File "/blank/home/user/.conda/envs/uncalled4_dev/bin/uncalled4", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/blank/home/user/.conda/envs/uncalled4_dev/lib/python3.12/site-packages/uncalled4/__main__.py", line 390, in main
    ret = fn(conf=conf)
          ^^^^^^^^^^^^^
  File "/blank/home/user/.conda/envs/uncalled4_dev/lib/python3.12/site-packages/uncalled4/stats/refstats.py", line 69, in refstats
    stats = chunk.calc_refstats()
            ^^^^^^^^^^^^^^^^^^^^^
  File "/blank/home/user/.conda/envs/uncalled4_dev/lib/python3.12/site-packages/uncalled4/tracks.py", line 1464, in calc_refstats
    .pivot(index=["seq.pos","seq.fwd"], columns="aln.track")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/blank/home/user/.conda/envs/uncalled4_dev/lib/python3.12/site-packages/pandas/core/frame.py", line 9339, in pivot
    return pivot(self, index=index, columns=columns, values=values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/blank/home/user/.conda/envs/uncalled4_dev/lib/python3.12/site-packages/pandas/core/reshape/pivot.py", line 537, in pivot
    indexed = data.set_index(
              ^^^^^^^^^^^^^^^
  File "/blank/home/user/.conda/envs/uncalled4_dev/lib/python3.12/site-packages/pandas/core/frame.py", line 6122, in set_index
    raise KeyError(f"None of {missing} are in the columns")
KeyError: "None of ['seq.pos', 'seq.fwd', 'aln.track'] are in the columns"
skovaka commented 2 months ago

Sorry for the delay, I fixed your most recent None of ... are in the columns error which occurred when you only compute "dtw.current ks" statistics, and pushed the fix to the dev branch.

Unfortunately I still haven't been able to reproduce your reference coordinate issue. Which sequencing chemistries have you tested (e.g. RNA004, r10.4.1 DNA?), and does the error occur if you downsample reads to < ~50x coverage? Every dataset I've tried seems to work properly. If you can share a small Uncalled4 aligned BAM and the corresponding reference FASTA which reproduces the error, that would be very helpful. If you prefer to share privately you could email it to skovaka1 jhu.edu

skovaka commented 2 months ago

I should also mention: try running pip uninstall uncalled4 and remove the build/ directory if present in the uncalled4 repo before re-installation, just to make sure you fully re-compile the code