Closed 6br closed 4 years ago
I ran into similar issues. Please try --cells-per-file 10000
. That worked for every data set so far. But I don't know a formula for how to decide the best value.
Non-consecutive chunk filenames is the intended functionality. All filenames are listed in bin2file.json and should be indexed from there, not by guessing the filename. --cells-per-file
should be used to control the size of the files, not their names. The only constraint on chunk filenames is that they are unique per dataset. They're named by first_bin, but because some components are larger than the chunk step size, you can end up skipping "consecutive" numbers. Using the bin2file.json correctly will become even more important when we introduce zoom levels for binning. I found it was much more reliable to have the exact filename listed than to try and guess based on a convention. For example, you run into zero padding width errors using only conventions.
If we did want to make these consecutive, it'd be as simple as adding a counter on this line: https://github.com/graph-genome/component_segmentation/blob/8c3cbefe6cf6189600c00f2c8769c2392928c0ba/matrixcomponent/PangenomeSchematic.py#L63
This would break the association between chunk number and pangenome position. You can have one or the other, but not both.
I figured out. I changed the implementation of graph-genome/pipeline to fetch file names of start/end chunks from bin2file.json
. However, I still see some unexpected behaviours in Schematize, and I suspect that is due to non-consecutive chunk filenames. I open another issue on Schematize.
Thank you Toshiyuki. If you find any place where someone has written code like "chunk" + str(chunk_number +1) + ".json" then that should be purged.
Sure, I'll close this because it does not cause practical problem.
Hi, I am running on SARS dataset. SARS-CoV-2.genbank.20200329.complete.odgi.sorted.w1000.json
I run
python3 matrixcomponent/segmentation.py -j SARS-CoV-2.genbank.20200329.complete.odgi.sorted.w1000.json --cells-per-file 5 -o seg
However, the output of chunk ids are not serial numbers.
As shown,
00,02,04,12,17,22,28,33,39,45,50
are not serial number.Should I change the parameter
--cells-per-file
? if so, how to decide thecells-per-file
parameter?