graph-genome / component_segmentation

Read in ODGI Bin output and identify co-linear components
Apache License 2.0
3 stars 4 forks source link

IndexError: list index out of range #5

Closed 6br closed 4 years ago

6br commented 4 years ago

I run component_segmentation of the latest master HEAD on data3.gfa (located on /home/ubuntu/ty/test-pipeline/)

$ tail -n 20 data3.seg.log
Segmenting 934 100.0%
Segmenting 248918 100.0%
Largest bin_id was 3027286
Found 4759 dividers.
Eliminated 0 self-loops
{(1079092, 1265532), (393836, 494724), (410335, 1676518), (1906142, 1906161), (618645, 865824), (690656, 726629), (636617, 1277533), (610068, 798082), (299767, 494806), (1970869, 1970947), (2158577, 2159556), (656673, 656688), (653903, 690499), (896138, 1525586), (653569, 1427747), (586409, 1584587), (652877, 819897)}
Created 4759 components
Populated Matrix per component per path.
Populated Occupancy per component per path.
Created 8592 LinkColumns
Traceback (most recent call last):
  File "matrixcomponent/segmentation.py", line 275, in <module>
    main()
  File "matrixcomponent/segmentation.py", line 271, in main
    write_json_files(args.json_file, schematic)
  File "matrixcomponent/segmentation.py", line 207, in write_json_files
    partitions, bin2file_mapping = schematic.split(args.cells_per_file)
  File "/usr/src/app/component_segmentation/matrixcomponent/PangenomeSchematic.py", line 70, in split
    these_comp[0].first_bin,
IndexError: list index out of range
subwaystation commented 4 years ago

I ran into the same issue yesterday. I suspect that you did not use the current master of odgi to build the graph? After that, all went fine. @josiahseaman replaced the median nucleotide position with the actual starting and ending positions. And I suspect you used the version which still outputs the median nucleotide position. Please report back.

6br commented 4 years ago

Thank you, @subwaystation . I used https://github.com/graph-genome/odgi.git in graph-genome/pipeline. Now I switch to https://github.com/vgteam/odgi.git. Another question: What does it mean bin-size in component_segmentation? in other words, what is better number on bin-size? I used to set it as the same as -bin-width (in Schematize).

subwaystation commented 4 years ago

So the idea is that we present Schematize the bin width in the file output of cs (component_segmentation). Therefore, the user should enter this as an argument. Yes, it is the same as --bin-width in Schematize and odgi. But in the future, we will read this in. Maybe we can have a short chat today, so I can update you on our plans ;)

6br commented 4 years ago

I confused that because the option name is different between odgi (bin-width) and cs(bin-size). But now I completely understand. Thank you!

6br commented 4 years ago

I tried it on vgteam/odgi master HEAD, but the same error is shown.

subwaystation commented 4 years ago

Currently taking a look.

josiahseaman commented 4 years ago

@6br Component_segmentation is made to work with the latest odgi with first_nucleotide and last_nucleotide as you requested. Here's our fork: https://github.com/graph-genome/odgi

It was merged to vgteam master: https://github.com/vgteam/odgi/pull/79

So if your code is older than Feb 20 you might not have gotten it. https://github.com/vgteam/odgi/commits/master

subwaystation commented 4 years ago

@6br was using our latest pulls (odgi, component_segmentation) on pg2. So it should have worked. I can reproduce the error. Please take a look at /home/ubuntu/sh/test_data3_ty and the scripts in there to reproduce. If I understand it right, this error occurs during the calculation of the links. Interestingly, this problem was not present when munching the phage data.

josiahseaman commented 4 years ago

Actually, this is my fault. There's an edge case that only comes up in TY's data.

    these_comp = self.components[cut:end_cut]
        if these_comp: # when we're out of data because the last component is 1 wide
                partitions.append(
                    PangenomeSchematic(JSON_VERSION,
                                       self.bin_size,
                                       these_comp[0].first_bin,
                                       these_comp[-1].last_bin,
                                       these_comp, self.path_names, self.total_nr_files))
                bin2file_mapping.append({"first_bin": these_comp[0].first_bin, "file": self.filename(i)})