marschall-lab / gaftools

General purpose utility related to GAF files
https://gaftools.readthedocs.io/
MIT License
11 stars 0 forks source link

order_gfa with minigraph-cactus #18

Closed asylvz closed 10 months ago

asylvz commented 10 months ago

I'm getting the error below when with minigraph cactus: gaftools order_gfa /gpfs/project/projects/medbioinf/users/asoylev/hgsvc3/ref_graph/hprc-v1.1-mc-chm13.gfa

Reading /gpfs/project/projects/medbioinf/users/asoylev/hgsvc3/ref_graph/hprc-v1.1-mc-chm13.gfa
The graph has:
The graph has 92879580 Nodes and 128165765 Edges
Connected components: 25
Traceback (most recent call last):
  File "/home/liy34nag/bin/gaftools", line 33, in <module>
    sys.exit(load_entry_point('gaftools', 'console_scripts', 'gaftools')())
  File "/gpfs/project/projects/medbioinf/users/asoylev/gaftools/gaftools/__main__.py", line 88, in main
    module.main(args)
  File "/gpfs/project/projects/medbioinf/users/asoylev/gaftools/gaftools/cli/order_gfa.py", line 218, in main
    run_order_gfa(**vars(args))
  File "/gpfs/project/projects/medbioinf/users/asoylev/gaftools/gaftools/cli/order_gfa.py", line 39, in run_order_gfa
    components = name_comps(graph, components)
  File "/gpfs/project/projects/medbioinf/users/asoylev/gaftools/gaftools/cli/order_gfa.py", line 195, in name_comps
    counts = Counter([graph[n].tags["SN"][1] for n in comp])
  File "/gpfs/project/projects/medbioinf/users/asoylev/gaftools/gaftools/cli/order_gfa.py", line 195, in <listcomp>
    counts = Counter([graph[n].tags["SN"][1] for n in comp])
KeyError: 'SN'
fawaz-dabbaghieh commented 10 months ago

Thanks for pointing this out, I fixed the bug, it was there because my assumption was that all the nodes would have an SN tag. So now I just check if they do and count the SN tag values for the one that do have it.

Moreover, as it was originally written by Tobias, if the SN tags don't reflect the hard-coded names (chr1, chr2, chr3, etc...), then you need to provide these names, otherwise, when you run it, you'll wait a lot of time before getting this message

gaftools order_gfa /gpfs/project/projects/medbioinf/users/asoylev/hgsvc3/ref_graph/hprc-v1.1-mc-chm13.gfa --outdir output/
Reading /gpfs/project/projects/medbioinf/users/asoylev/hgsvc3/ref_graph/hprc-v1.1-mc-chm13.gfa
The graph has:
The graph has 92879580 Nodes and 128165765 Edges
Connected components: 25
Chromosome set mismatch:
  Expected: chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY,chrM
  Found: CHM13#chr1,CHM13#chr10,CHM13#chr11,CHM13#chr12,CHM13#chr13,CHM13#chr14,CHM13#chr15,CHM13#chr16,CHM13#chr17,CHM13#chr18,CHM13#chr19,CHM13#chr2,CHM13#chr20,CHM13#chr21,CHM13#chr22,CHM13#chr3,CHM13#chr4,CHM13#chr5,CHM13#chr6,CHM13#chr7,CHM13#chr8,CHM13#chr9,CHM13#chrM,CHM13#chrX,CHM13#chrY

So you need to provide the correct chromosome names as present in the SN tags of that graph. Maybe we need to find a better way to do this later.

asylvz commented 10 months ago

Can't you do a fast skip through the file to get the names. The "Found" line in your error message is what you actually need, right?

fawaz-dabbaghieh commented 10 months ago

I don't think I follow what you mean by fast skip here. I fixed the error you got because I was checking for the SN tag and some nodes didn't have any. Now I check if there's an SN tag, and if there's I count it for the component to name the component

fawaz-dabbaghieh commented 10 months ago

I just made another push, maybe I understood what you meant. But now, if the user doesn't provide anything for --chromosome_order, the program will run and just output whatever components are found and use the majority-vote SN tag to name that component. If the user provides only a subset of the chromosomes (e.g. chr1,chr22) for --chromosome_order argument, then only those chromosomes will be outputted. If the user provides a chromosome name that doesn't exist in the SN tags, or at least wasn't a majority vote, then the program will abort.

I will close this bug, as I ran it now and all seems to work :)