Closed GoogleCodeExporter closed 9 years ago
Thanks for your report.
I've found that test150.txt is not correctly ordered.
In detail, the 65th line is greater than the 66th line.
65: +.n|+.x|a.x|c.n;b.nAgAmACoA
66: +.n|+.x|a.x|c.n;b.nAgAmAC0A
I've checked this as follows:
$ LC_ALL=C sort test150.txt > sorted.txt
$ diff test
$ diff test150.txt sorted.txt
65,66d64
< +.n|+.x|a.x|c.n;b.nAgAmACoA
< +.n|+.x|a.x|c.n;b.nAgAmAC0A
67a66,67
> +.n|+.x|a.x|c.n;b.nAgAmAC0A
> +.n|+.x|a.x|c.n;b.nAgAmACoA
Would you test again after this correction?
Original comment by susumu.y...@gmail.com
on 12 Oct 2012 at 1:05
yes sorry it does seem that the file is out of order, thanks.
However now, using my full data set (a 2.3GB text file) I get the following
error:
bash-3.2$ dawgdic-build -g testBundleSubs2sorted.txt testBundleSubs2.dawg
no. keys: 46112348
no. states: 307452395
no. transitions: 348225617
no. merged states: 1073607010
no. merging states: 39222660
no. merged transitions: 1032833787
error: failed to build Dictionary
I think in fact that this is the problem I was actually tracing back from the
python wrapper.
The data is quite large, does dawgdic have any kind of size limits that may be
the problem?
Unfortunately I can't share the full data file.
Thanks
Original comment by peter.mc...@gmail.com
on 13 Oct 2012 at 3:29
The number of units in Dictionary is limited to 2^29 (around 536M).
I'm not sure how many keys you can put in a Dictionary, but the number of units
must be greater than the number of transitions.
In this case, maybe you can put a half of the data set in one Dictionary.
Original comment by susumu.y...@gmail.com
on 14 Oct 2012 at 11:42
Ok, I suspected I was hitting some sort of size limit.
Rather than split the data over multiple dictionaries I've switched over to
using marisa-trie (I haven't yet hit any upper limits there).
Thanks for the help, and for the library, which will almost certainly be useful
for the future.
Original comment by peter.mc...@gmail.com
on 14 Oct 2012 at 2:24
It's a nice idea.
I heard that 1 billion keys (probably several times larger than your data set)
can be stored in one marisa-trie instance.
But please note that marisa-trie is much slower than dawgdic.
If you need a fast data structure, it might be a bad choice.
Good luck.
Original comment by susumu.y...@gmail.com
on 15 Oct 2012 at 9:02
Original issue reported on code.google.com by
peter.mc...@gmail.com
on 11 Oct 2012 at 6:06Attachments: