Invalid modkit pileup file

handoko12u commented 3 months ago

Hello @marcpaga

I have a very large modkit pileup file, from a WGS data, it is 198GB. So, in randomly shuffle 10000000 lines only, and I print it in a new bed file name subsample.bed.

I run: sturgeon inputtobed, but it always mentioned the bed file was invalid. See the screenshot below:

I check my subsample.bed file, here is the first few lines: That is exactly the format of a modkit pileup output.

Why I encounter this error?

And another question, in sturgeon live, the input file is the bam files, why in sturgeon predict, we need a bed file that first need to be transformed with sturgeon inputtobed?

By the way, I tried sturgeon live with the bam files generated from dorado, but it is predicting a completely wrong tumor. Is it because I align it to hg38? But I have included --reference-genome hg38 when I run the sturegon live?

Please help, thank you

marcpaga commented 3 months ago

Hi @handoko12u,

Can you share with me the header of the modkit file, with the columns names? These are the column names that are expected:

column_names = [
  "chrom",
  "chromStart",
  "chromEnd",
  "mod_code",
  "score_bed",
  "strand",
  "thickStart",
  "thickEnd",
  "color",
  "valid_cov",
  "percent_modified",
  "n_mod",
  "n_canonical",
  "n_othermod",
  "n_delete",
  "n_fail",
  "n_diff",
  'n_nocall'
]

Modkit pileup worked on my end, but perhaps different modkit versions have different output structures.

Regarding live, I will deprecate it since it is too complex to maintain. We use modbampy to extract methylation calls, but the tool has been deprecated. So it could very well be that dorado encodes methylation differently, and then the library gives wrong results, which will lead to an erroneous classification.

It could also be that just the prediction is wrong, Sturgeon is not 100% correct all the time, we expect 5% error rate at scores >0.95. Could you please try to classify the sample without live to see if its truly a wrong classification and not a data processing artifact?

handoko12u commented 3 months ago

subsample.txt

My bed file has no header, but there are 18 columns in my bed file, exactly the same as yours. Here I attach my subsampled bed file. Can help me to you try whether it work or not in your end?

Thank you

marcpaga commented 3 months ago

Hi @handoko12u,

can you try with the new version?

The error due to modkit pileup making a file in which some columns are separated by tabs, and others by spaces.

handoko12u commented 3 months ago

Hello @marcpaga

Thank you, it works now. I reinstalled my modkit to the newest version.

I am using 64 GB RAM, when I want to process a bed file with 38 GB, it ran out of memory. When I randomly sampled just 10% of the BED file lines, it was fine. I wonder is there any way to reduce the memory requirement?

Thank you

marcpaga commented 3 months ago

Hi @handoko12u,

I am glad it works now. This program was designed with the constrain to be used with little data, as we do in the intraoperative setting, so memory efficiency was not considered at the time. I think there are ways to reduce memory, but would require major rewrites of the code. I do not have the bandwidth to work on this unfortunately.

I can suggest a solution to the problem however. You can process a file in chunks, and then merge the subsequent generated bed files. A bed file with measurements in all the probes is ~20Mb, therefore you would end up maximally with ~200Mb worth of files. You can use majority voting (or any other cutoff approach) to decide if a probe is methylated or not, and write that to the final bed file to be used for prediction. Hopefully something like this works if you want to use all that data at once. Another alternative is, since you have so much data, is to make an ensemble of predictions with chunks of data, which could give you robustness to the prediction. I hope this helps.

handoko12u commented 3 months ago

Okay, thank you so much @marcpaga, will follow your suggestion.

marcpaga / sturgeon

Invalid modkit pileup file #14