Open GoogleCodeExporter opened 9 years ago
It sounds like the system ran out of memory. How large is the corpus you are
trying to build? How much memory/disk space does your machine have? How large
is your swap space? The tool does support building models beyond the size of
the physical memory. However, you will likely need to increase the OS swap
space. Can you please confirm if this is indeed the issue? Thanks.
Paul
Original comment by bojune...@gmail.com
on 18 Jun 2010 at 4:15
Hi Paul,
thank you for your quick answer.
Well that sounds reasonable. I am going to check on the system's capabilities
as soon as I am near to the machine again. For what I know it is a linux x86
machine with 8GB of RAM, disk space should be sufficient but I have no
information yet about the swap space.
The corpus contains about 4.5 million normalized sentences and I am currently
using a 200k wordlist. This should be a couple of grams to observe. By the way:
I get the same error when I try to build a 3-gram model with the full 9 million
sentences corpus.
Best regards.
Original comment by sebastia...@googlemail.com
on 19 Jun 2010 at 7:29
Out of memory would also explain why the trigram model fails with the full
corpus. Increasing the swap space will enable you to build larger models. The
LM building process will be significantly slower, but it should eventually
finish. I would try increasing the swap space to 64GB and see what happens.
Original comment by bojune...@gmail.com
on 20 Jun 2010 at 12:25
Hi again,
so after increasing the swap space we had the following results:
------------------------------------------------------------------------
* What has been done?
Increased the swapspace to 16 GB, so we had 8 GB of RAM and
16 GB of swap space on an linux x64 quadcore opteron 2360se
* Experiment 1
Using 9.3 million sentences for training and 500k as held out for
parameter tuning to calculate a 3-gram model with 200k words
Peak memory usage: 36.1%
Peak swap usage: 3.3 MB
(no other non-system processes running!)
Result:
- mitlm successful but model could not be read by julius decoder with error
Error: ngram_read_arpa: 3-gram #47080287: "-0.655545 zukü": "zukü" not
exist in 3-gram
Error: init_ngram: failed to read "mitlm.f3g.9.3mio.200k.opt.arpa"
-> at least this indicates that the arpa model has not been created successfully
(note: when I split the corpus into two parts from sentence 1-4.5mio and 4.5mio
to 9.3mio,
it is possible to build both models, but not combined - so I assume there is nothing
seriously wrong with the corpus content)
* Experiment 2
Using 4.5 million sentences for training and 500k as held out for
parameter tuning to calculate a 4-gram model with 200k words
Peak memory usage: 29.4 %
Peak swap usage: 3.3 MB
(no other non-system processes running!)
Result: Aborted during parameter optimization with exception
estimate-ngram: src/vector/DenseVector.tcc:406: void
DenseVector<T>::_allocate() [with T = int]: Assertion `_data' failed.
* Experiment 3
Using 9.8 million sentences for training and no held out
to calculate a 3-gram model with 200k words
Peak memory usage: 31.5%
Peak swap usage: 3.7 MB
(no other non-system processes running!)
Result:
- mitlm successful but model could not be read by julius decoder with error
Error: ngram_read_arpa: data format error: end marker "\end" not found
Error: init_ngram: failed to read "mitlm.f3g.9.8mio.200k.noopt.arpa"
-> at least this indicates that the arpa model has not been created successfully
- evaluate-ngram does not react during arpa file loading
(note: creating arpa model works with HTK speech recognition toolkit, so I
assume there is nothing seriously wrong with the corpus content)
------------------------------------------------------------------------
Any suggestions?
Best regards.
Original comment by sebastia...@googlemail.com
on 21 Jun 2010 at 3:52
For experiment 1, can you please check the content of the ARPA file and find
all occurrences of "zukü"? Specifically, can you find 1, 2, and 3-grams?
For experiment 2, can you repeat with a debug build and use a debugger to
access the stack trace? Without access to the data, it will be tricky to
identify the cause.
For experiment 3, can you verify that the ARPA file actually does not contain
an "\end" marker at the end of the file?
Finally, can you please let me know the sizes of all the input and output files
so I can get a rough estimate of the scale of the data? Thanks.
Paul
Original comment by bojune...@gmail.com
on 21 Jun 2010 at 4:26
[deleted comment]
[deleted comment]
Hi Paul,
so here is the requested information...
-----------------------------------------------------------------
"For experiment 1, can you please check the content of the ARPA
file and find all occurrences of "zukü"? Specifically, can you
find 1, 2, and 3-grams?"
-----------------------------------------------------------------
Since the arpa sorts the grams alphabetically, I can observe that
the arpa file just stops during the 3-grams listing.
Last entry is "-0.655545 zukü" and then the line ends abortive.
"zukü" is also not a legal word in the language under test.
See below for further observations.
-----------------------------------------------------------------
For experiment 3, can you verify that the ARPA file actually does
not contain an "\end" marker at the end of the file?
-----------------------------------------------------------------
Since the arpa sorts the grams alphabetically, I can observe that
the arpa file just stops at the 3-grams starting with "v" for n-2.
So, no, there is no end tag. In this case the last trigram in the
list happened to contain only legal words (last one before the line
suddenly ended is a legal letter so there is an unigram for it) so
therefore it's reasonable that in this case julius only had a
problem with the missing end tag.
Interestingly, the arpa files with the missing ends have the same
exact byte size of 2147483647 (see belov) which is the maximum
value for a 32-bit signed integer.
-----------------------------------------------------------------
For experiment 2, can you repeat with a debug build and use a
debugger to access the stack trace? Without access to the data,
it will be tricky to identify the cause.
-----------------------------------------------------------------
I used gdb/ddd to debug the process (though I am not really familiar
with this procedure). Here's the output:
me@computer:~> gdb estimate-ngram
GNU gdb (GDB) SUSE (6.8.91.20090930-2.4)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/me/bin/mitlm/estimate-ngram...done.
(gdb) run -order 4 -v
/home/me/blob/lm/lm-tool-comparison/mitlm/mitlm.f4g.4.5mio.200k.opt/wlist -unk
-t /home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.txt -opt-perp
/home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.backoff.txt -wl
mitlm.f4g.4.5mio.200k.opt.arpa -wc mitlm.f4g.4.5mio.200k.opt.arpa.counts
Starting program: /home/me/bin/mitlm/estimate-ngram -order 4 -v
/home/me/blob/lm/lm-tool-comparison/mitlm/mitlm.f4g.4.5mio.200k.opt/wlist -unk
-t /home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.txt -opt-perp
/home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.backoff.txt -wl
mitlm.f4g.4.5mio.200k.opt.arpa -wc mitlm.f4g.4.5mio.200k.opt.arpa.counts
0.000 Loading vocab
/home/me/blob/lm/lm-tool-comparison/mitlm/mitlm.f4g.4.5mio.200k.opt/wlist...
0.170 Loading corpus
/home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.txt...
717.120 Smoothing[1] = ModKN
717.120 Smoothing[2] = ModKN
717.120 Smoothing[3] = ModKN
717.120 Smoothing[4] = ModKN
717.120 Set smoothing algorithms...
731.510 Loading development set
/home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.backoff.txt...
estimate-ngram: src/vector/DenseVector.tcc:406: void
DenseVector<T>::_allocate() [with T = int]: Assertion `_data' failed.
Program received signal SIGABRT, Aborted.
0xffffe425 in __kernel_vsyscall ()
-----------------------------------------------------------------
Finally, can you please let me know the sizes of all the input
and output files so I can get a rough estimate of the scale of the
data?
-----------------------------------------------------------------
Full corpus 9.8 mio. sentences: 1.155 GB
Half corpus 4.5 mio. sentences: 0.5546 GB
All smaller sets (e.g., 9.3 mio. training + 0.5 mio. heldout) are subsets of
the full corpus
Lexicon: 195.121 words
Exp.1:
Arpa file: 2147483647 Byte (2.14 GB)
Count file: 1.542 GB
Exp.2:
Since the experiment crashes, there is no information here.
Exp.3:
Arpa file: 2147483647 Byte (2.14 GB)
Count file: 1.606 GB
----
Best regards.
Original comment by sebastia...@googlemail.com
on 22 Jun 2010 at 9:45
It sounds like you are encountering some sort of 2GB file limit. MITLM itself
appears to be fine until it attempts to write the LM to file.
Can you please verify that you can generate larger files in the same directory
concatenating two large files?
cat file1 file2 > file3
Can you also try running this on a different machine? To isolate the issue,
try to run everything on local disk instead of over the network. If this still
doesn't work, can you please describe your system configuration? I currently
suspect most of these issues result from some file system limit.
Original comment by bojune...@gmail.com
on 22 Jun 2010 at 4:36
Hi again,
--------------------------------------------------------
Can you please verify that you can generate larger files
in the same directory concatenating two large files?
--------------------------------------------------------
Yes. It is possible.
--------------------------------------------------------
Can you also try running this on a different machine?
--------------------------------------------------------
I tried Exp. 1 and Exp. 3 on my local machine (not the opteron) and got the
assertions (see #4). So there is definetely an effect when switching
to a machine with lesser performance in conjunction with this
assertion-failed-error. The local system under test was an
IBM Thinkpad with some 1.5 GHz Single Core with 512 MB. So I
guess the problem with the 4-grams of experiment 2 (#4) are related
to a performance problem. Maybe I should try it with some cutoff for the
3-gram and/or 4-gram counts. By the way: MITLM does not support such
a cutoff feature, right?
--------------------------------------------------------
To isolate the issue, try to run everything on local disk
instead of over the network.
--------------------------------------------------------
I did that for Exp. 1 and 2. I had the exact same problem as before.
Creating the model seems to work fine but there is some error
when writing the model to a disk.
The arpa file only contains the first 2147483647 Bytes as well.
On this particular local disk I successfully repeated the "cat" union
of two 2.14 GB file. No disk limitation here. The machine I used
is described below.
--------------------------------------------------------
If this still doesn't work, can you please describe your
system configuration? I currently suspect most of these
issues result from some file system limit.
--------------------------------------------------------
Here it is:
meminfo:
MemTotal: 7934804 kB
MemFree: 1518820 kB
Buffers: 748 kB
Cached: 3011228 kB
SwapCached: 2104 kB
Active: 5237372 kB
Inactive: 1008168 kB
Active(anon): 2752328 kB
Inactive(anon): 481252 kB
Active(file): 2485044 kB
Inactive(file): 526916 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 16771820 kB
SwapFree: 16706040 kB
Dirty: 16 kB
Writeback: 0 kB
AnonPages: 3232788 kB
Mapped: 16528 kB
Slab: 113032 kB
SReclaimable: 92724 kB
SUnreclaim: 20308 kB
PageTables: 13408 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 20739220 kB
Committed_AS: 4195476 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 34744 kB
VmallocChunk: 34359698355 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 10240 kB
DirectMap2M: 8116224 kB
cpuinfo (identically for all 4 cores):
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : Quad-Core AMD Opteron(tm) Processor 2360 SE
stepping : 3
cpu MHz : 2500.253
cache size : 512 KB
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt rdtscp lm
3dnowext 3dnow constant_tsc rep_good tsc_reliable nonstop_tsc extd_apicid pni
cx16 popcnt lahf_lm extapic abm sse4a misalignsse 3dnowprefetch
bogomips : 5000.20
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
local disk in use:
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sdb1 103210940 8668968 89299164 9% /data
other info:
Linux computer 2.6.31.12-0.2-desktop #1 SMP PREEMPT 2010-03-16 21:25:39 +0100
x86_64 x86_64 x86_64 GNU/Linux
Linux version 2.6.31.12-0.2-desktop (geeko@buildhost) (gcc version 4.4.1
[gcc-4_4-branch revision 150839] (SUSE Linux) ) #1 SMP PREEMPT 2010-03-16
21:25:39 +0100
openSUSE 11.2 (x86_64)
VERSION = 11.2
--------------------------------------------------------
Thanks for your help!
Best regards.
Original comment by sebastia...@googlemail.com
on 23 Jun 2010 at 12:30
Running out of ideas. A few things you can try:
- Write to .arpa.gz instead of .arpa. MITLM will automatically compress the
file using gzip. My hypothesis is that we will still hit the problem at 2GB,
if the compressed file exceeds that size.
- Verify that we are indeed encountering an OS file size issue by creating a
simple C++ executable that writes a large file using the same technique as
MITLM. See src/util/ZFile.h.
No, MITLM does not currently support count cutoffs. It should work if you
manually remove n-grams from the count file though, although I have not tried
it before. You will likely have to tune the discount parameters in this case
as the estimation from count of count statistics will not apply.
Paul
Original comment by bojune...@gmail.com
on 23 Jun 2010 at 4:30
Hi Paul,
just to inform you: I am going to further investigate this issue as soon as I
am done with some other things. Unfortunately there were some urgent issues
here and there - as always.
Thanks for the help so far.
Original comment by sebastia...@googlemail.com
on 5 Jul 2010 at 2:17
Hi Paul,
I tested possibility 1 enabling me to build an arpa file (3-gram, 9.7mio sent.,
0.1mio for heldout) beyond the size of 2147483647 Bytes. The final arpa file
(after successfully unpacking) had a size of 2167945068 Bytes whereas the gz
file had a size below the 2.14GB threshold. When I try to create a model with
the same parameters by writing the results directly to the arpa file I have the
same issues than before - the arpa file is missing any information "behind"
byte #2147483647
Finally i tried to create the "big" 3-gram model using 9.8 million sentences
without heldout (this always crashed when writing directly to arpa) and writing
the result to a gz file. The gz file had a size of 686332293 Bytes and
2345369756 Bytes after unpacking.
So at least that solved the problem of creating big models without really
identifying the reason for the problem. I will go into the source code. Maybe I
can find a hint there.
Best regards.
Original comment by sebastia...@googlemail.com
on 9 Jul 2010 at 11:29
Sounds like you have temporarily avoided the 2GB limit issue for now since both
.gz are below 2GB.
I am still troubled by the 2GB issue. Can you please verify that you are
building mitlm for 64bit?
Original comment by bojune...@gmail.com
on 9 Jul 2010 at 4:40
Well, i am not sure if I then missed something. I used the x86_64 machine above
doing the following:
- Install gcc-fortran (http://code.google.com/p/mitlm/issues/detail?id=14)
- Got patch
http://mitlm.googlecode.com/issues/attachment?aid=-4077538411916343304&name=mitl
m.0.4.ubuntu9.10.patch&token=be5f533c528fedb9ab1bd9065f4666d7
- Applied patch: patch -p1 -i mitlm.0.4.ubuntu9.10.patch
- make clean
- make DEBUG=1
Best regards.
Original comment by sebastia...@googlemail.com
on 11 Jul 2010 at 2:47
Hi, I have problem with smoothing large files of 3-grams counts.
Problem is that mitlm save files 1,5GB and then gave me this error:
I use :estimate-ngram -order 3 -counts allgrams -smoothing FixModKN -wl
allgrams.FixModKN.lm command and i get this error:
Saving LM to train.corpus.lm...
estimate-ngram: src/NgramModel.cpp:422: void NgramModel::SaveLM(const
std::vector<DenseVector<double>, std::allocator<DenseVector<double> > >&, const
std::vector<DenseVector<double>, std::allocator<DenseVector<double> > >&,
ZFile&) const: Assertion `(size_t)(ptr - lineBuffer.data()) <
lineBuffer.size()' failed.
Before I tried on 2-grams with 4,7GB files and it works fine. 3-grams file is
20GB big.
My operating system is GNU/Linux x86_64 with 96GB RAM.
Original comment by Roksana....@gmail.com
on 19 Nov 2012 at 10:24
Original issue reported on code.google.com by
sebastia...@googlemail.com
on 18 Jun 2010 at 8:47