Closed fangbohao closed 1 year ago
Some big chromosomes work well with 'vg autoindex', but small chromosomes did not work properly, occurring issues above.
Can you provide the command line call that you ran into this error on?
Thanks for your reply. Here you go:
vg autoindex --workflow giraffe \ -g $gfa_chr37 -t 23 \ --target-mem 90G
On Thu, Aug 4, 2022 at 4:16 PM Jordan Eizenga @.***> wrote:
Can you provide the command line call that you ran into this error on?
— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3712#issuecomment-1205724194, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQTTOOCEDJMTOBEKTAT5BYTVXQQJFANCNFSM55TXCDCQ . You are receiving this because you authored the thread.Message ID: @.***>
By the way, here is the GFA file I used, which is 52MB, a small chromosome.
Please let me know if the GFA file is wrong or not properly produced.
Thank you! Bohao Fang
VGP#prim#SUPER_37.pan.fa.gz.3051141.04f1c29.ecb... https://drive.google.com/file/d/1nLpGPHSlZs4h1hmfuJHcI3hOyIFIDvXY/view?usp=drive_web
On Thu, Aug 4, 2022 at 4:59 PM Bohao Fang @.***> wrote:
Thanks for your reply. Here you go:
vg autoindex --workflow giraffe \ -g $gfa_chr37 -t 23 \ --target-mem 90G
On Thu, Aug 4, 2022 at 4:16 PM Jordan Eizenga @.***> wrote:
Can you provide the command line call that you ran into this error on?
— Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/3712#issuecomment-1205724194, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQTTOOCEDJMTOBEKTAT5BYTVXQQJFANCNFSM55TXCDCQ . You are receiving this because you authored the thread.Message ID: @.***>
@adamnovak This looks to me like it's running into a problem in the named-node stuff you implemented. Could you take a look?
I came across this issue when using panSN-spec named input like
ARS_UCD12#hap0#6
but there is a stoll
call on the haplotype, so should just be numeric (i.e. "ARS_UCD12#0#6"). Not sure if this was causing the same issue, but I got very similar crash log.
I couldn't find clear documentation on the pathsense API, but from vg paths -Mv
it looks like it expects further groupings than panSN-spec? Is it possible to denote e.g. a primary assembly path vs a haplotype-resolved path or will everything need the sample ploidy to work?
Best, Alex
Found the [path metadata model[(https://github.com/vgteam/vg/wiki/Path-Metadata-Model) (I knew I had stumbled on it before), so will try with this a bit further
Unfortunately I can't get @fangbohao's file; it looks like it's a Google Drive upload shared with a specific list of people that I'm not on.
But it does seem like a path like ARS_UCD12#hap0#6
might be able to cause a crash in __gnu_cxx::__stoa
(which is the string-to-number converter) inside path name parsing.
By my reading of the panSN spec that I had when I wrote the path name parsing, that isn't valid panSN because the haplotype piece hap0
is a string; I thought only numbers were allowed there. Maybe that isn't really true?
Whether that's true or not, we should produce a more useful error when we can't parse the path name.
OK, @fangbohao shared the file with me, and I tested my fix, and I now have vg interpreting it like this:
[anovak@swords vg]% vg paths --metadata -x ~/Downloads/VGP\#prim\#SUPER_37.pan.fa.gz.3051141.04f1c29.ecbf8cf.smooth.final.gfa
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
MA_2#hap2#h2tg000495l GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE MA_2#hap2#h2tg000495l NO_PHASE_BLOCK NO_SUBRANGE
WA_2#hap1#h1tg000618l GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE WA_2#hap1#h1tg000618l NO_PHASE_BLOCK NO_SUBRANGE
NM_1#hap2#h2tg000401l GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE NM_1#hap2#h2tg000401l NO_PHASE_BLOCK NO_SUBRANGE
AZ_2#hap2#h2tg000020l GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE AZ_2#hap2#h2tg000020l NO_PHASE_BLOCK NO_SUBRANGE
CA_1#hap1#h1tg001701l GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE CA_1#hap1#h1tg001701l NO_PHASE_BLOCK NO_SUBRANGE
CA_1#hap2#h2tg004194l GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE CA_1#hap2#h2tg004194l NO_PHASE_BLOCK NO_SUBRANGE
CA_2#hap2#h2tg002977l GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE CA_2#hap2#h2tg002977l NO_PHASE_BLOCK NO_SUBRANGE
...
It's not parsing it as the file writer intended, I don't think, but it is parsing it to something we can represent. For the file to really work properly (and not result in a possibly unmanageable number of named paths), hap1
and hap2
need to be changed to just 1
and 2
. But with #4010 we should at least no longer crash like this.
1. What were you trying to do? I am trying to index a GFA graph file (a chromosome) derived from PGGB.
2. What did you want to happen? index done.
3. What actually happened? error message appears as above.
4. If you got a line like
Stack trace path: /somewhere/on/your/computer/stacktrace.txt
, please copy-paste the contents of that file here:5. What data and command can the vg dev team use to make the problem happen?
6. What does running
vg version
say?