graph-genome / component_segmentation

Read in ODGI Bin output and identify co-linear components
Apache License 2.0
3 stars 4 forks source link

Chunk ids are not consecutive. #15

Closed 6br closed 4 years ago

6br commented 4 years ago

Hi, I am running on SARS dataset. SARS-CoV-2.genbank.20200329.complete.odgi.sorted.w1000.json

I run python3 matrixcomponent/segmentation.py -j SARS-CoV-2.genbank.20200329.complete.odgi.sorted.w1000.json --cells-per-file 5 -o seg

However, the output of chunk ids are not serial numbers.

Saved results to seg/chunk00_bin1000.schematic.json
Saved results to seg/chunk02_bin1000.schematic.json
Saved results to seg/chunk04_bin1000.schematic.json
Saved results to seg/chunk12_bin1000.schematic.json
Saved results to seg/chunk17_bin1000.schematic.json
Saved results to seg/chunk22_bin1000.schematic.json
Saved results to seg/chunk28_bin1000.schematic.json
Saved results to seg/chunk33_bin1000.schematic.json
Saved results to seg/chunk39_bin1000.schematic.json
Saved results to seg/chunk45_bin1000.schematic.json
Saved results to seg/chunk50_bin1000.schematic.json

As shown, 00,02,04,12,17,22,28,33,39,45,50 are not serial number.

Should I change the parameter --cells-per-file? if so, how to decide the cells-per-file parameter?

6br commented 4 years ago

Data here data.zip

subwaystation commented 4 years ago

I ran into similar issues. Please try --cells-per-file 10000. That worked for every data set so far. But I don't know a formula for how to decide the best value.

josiahseaman commented 4 years ago

Non-consecutive chunk filenames is the intended functionality. All filenames are listed in bin2file.json and should be indexed from there, not by guessing the filename. --cells-per-file should be used to control the size of the files, not their names. The only constraint on chunk filenames is that they are unique per dataset. They're named by first_bin, but because some components are larger than the chunk step size, you can end up skipping "consecutive" numbers. Using the bin2file.json correctly will become even more important when we introduce zoom levels for binning. I found it was much more reliable to have the exact filename listed than to try and guess based on a convention. For example, you run into zero padding width errors using only conventions.

josiahseaman commented 4 years ago

If we did want to make these consecutive, it'd be as simple as adding a counter on this line: https://github.com/graph-genome/component_segmentation/blob/8c3cbefe6cf6189600c00f2c8769c2392928c0ba/matrixcomponent/PangenomeSchematic.py#L63

This would break the association between chunk number and pangenome position. You can have one or the other, but not both.

6br commented 4 years ago

I figured out. I changed the implementation of graph-genome/pipeline to fetch file names of start/end chunks from bin2file.json. However, I still see some unexpected behaviours in Schematize, and I suspect that is due to non-consecutive chunk filenames. I open another issue on Schematize.

josiahseaman commented 4 years ago

Thank you Toshiyuki. If you find any place where someone has written code like "chunk" + str(chunk_number +1) + ".json" then that should be purged.

6br commented 4 years ago

Sure, I'll close this because it does not cause practical problem.