kimiamania / mitlm

BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

estimate-ngram crashes on 4-gram modeling #19

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Hello,

i tried to create an 4-gram language model with the help of your estimate-ngram 
tool which led to the following debug output:

0.000   Loading vocab wlist...
0.170   Loading corpus corpus.txt...
estimate-ngram: src/vector/DenseVector.tcc:406: void 
DenseVector<T>::_allocate() [with T = int]: Assertion `_data' failed.

I used the command:
estimate-ngram -order 4 -v wlist -unk -t corpus.txt -wl arpa

When i try to create a trigram model from the same corpus, the tool runs the 
task like a charm. 

Original issue reported on code.google.com by sebastia...@googlemail.com on 18 Jun 2010 at 8:47

GoogleCodeExporter commented 9 years ago
It sounds like the system ran out of memory.  How large is the corpus you are 
trying to build?  How much memory/disk space does your machine have?  How large 
is your swap space?  The tool does support building models beyond the size of 
the physical memory.  However, you will likely need to increase the OS swap 
space.  Can you please confirm if this is indeed the issue?  Thanks.

Paul 

Original comment by bojune...@gmail.com on 18 Jun 2010 at 4:15

GoogleCodeExporter commented 9 years ago
Hi Paul,
thank you for your quick answer.

Well that sounds reasonable. I am going to check on the system's capabilities 
as soon as I am near to the machine again. For what I know it is a linux x86 
machine with 8GB of RAM, disk space should be sufficient but I have no 
information yet about the swap space. 

The corpus contains about 4.5 million normalized sentences and I am currently 
using a 200k wordlist. This should be a couple of grams to observe. By the way: 
I get the same error when I try to build a 3-gram model with the full 9 million 
sentences corpus.

Best regards.

Original comment by sebastia...@googlemail.com on 19 Jun 2010 at 7:29

GoogleCodeExporter commented 9 years ago
Out of memory would also explain why the trigram model fails with the full 
corpus.  Increasing the swap space will enable you to build larger models.  The 
LM building process will be significantly slower, but it should eventually 
finish.  I would try increasing the swap space to 64GB and see what happens.

Original comment by bojune...@gmail.com on 20 Jun 2010 at 12:25

GoogleCodeExporter commented 9 years ago
Hi again,
so after increasing the swap space we had the following results:

------------------------------------------------------------------------

* What has been done?
Increased the swapspace to 16 GB, so we had 8 GB of RAM and 
16 GB of swap space on an linux x64 quadcore opteron 2360se

* Experiment 1
Using 9.3 million sentences for training and 500k as held out for
parameter tuning to calculate a 3-gram model with 200k words

Peak memory usage: 36.1%
Peak swap usage: 3.3 MB
(no other non-system processes running!)

Result: 
- mitlm successful but model could not be read by julius decoder with error
Error: ngram_read_arpa: 3-gram #47080287: "-0.655545    zukü": "zukü" not 
exist in 3-gram
Error: init_ngram: failed to read "mitlm.f3g.9.3mio.200k.opt.arpa"  
-> at least this indicates that the arpa model has not been created successfully
(note: when I split the corpus into two parts from sentence 1-4.5mio and 4.5mio 
to 9.3mio,
 it is possible to build both models, but not combined - so I assume there is nothing
seriously wrong with the corpus content)

* Experiment 2
Using 4.5 million sentences for training and 500k as held out for
parameter tuning to calculate a 4-gram model with 200k words

Peak memory usage: 29.4 %
Peak swap usage: 3.3 MB
(no other non-system processes running!)

Result: Aborted during parameter optimization with exception
estimate-ngram: src/vector/DenseVector.tcc:406: void 
DenseVector<T>::_allocate() [with T = int]: Assertion `_data' failed.

* Experiment 3
Using 9.8 million sentences for training and no held out 
to calculate a 3-gram model with 200k words

Peak memory usage: 31.5%
Peak swap usage: 3.7 MB
(no other non-system processes running!)

Result:
- mitlm successful but model could not be read by julius decoder with error
Error: ngram_read_arpa: data format error: end marker "\end" not found
Error: init_ngram: failed to read "mitlm.f3g.9.8mio.200k.noopt.arpa"
-> at least this indicates that the arpa model has not been created successfully
- evaluate-ngram does not react during arpa file loading
(note: creating arpa model works with HTK speech recognition toolkit, so I 
assume there is nothing seriously wrong with the corpus content)

------------------------------------------------------------------------

Any suggestions?

Best regards.

Original comment by sebastia...@googlemail.com on 21 Jun 2010 at 3:52

GoogleCodeExporter commented 9 years ago
For experiment 1, can you please check the content of the ARPA file and find 
all occurrences of "zukü"?  Specifically, can you find 1, 2, and 3-grams?

For experiment 2, can you repeat with a debug build and use a debugger to 
access the stack trace?  Without access to the data, it will be tricky to 
identify the cause.

For experiment 3, can you verify that the ARPA file actually does not contain 
an "\end" marker at the end of the file?

Finally, can you please let me know the sizes of all the input and output files 
so I can get a rough estimate of the scale of the data?  Thanks.

Paul

Original comment by bojune...@gmail.com on 21 Jun 2010 at 4:26

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Hi Paul,

so here is the requested information...

-----------------------------------------------------------------
"For experiment 1, can you please check the content of the ARPA 
file and find all occurrences of "zukü"?  Specifically, can you 
find 1, 2, and 3-grams?"
-----------------------------------------------------------------

Since the arpa sorts the grams alphabetically, I can observe that
the arpa file just stops during the 3-grams listing.
Last entry is "-0.655545       zukü" and then the line ends abortive. 
"zukü" is also not a legal word in the language under test.
See below for further observations.

-----------------------------------------------------------------
For experiment 3, can you verify that the ARPA file actually does 
not contain an "\end" marker at the end of the file?
-----------------------------------------------------------------

Since the arpa sorts the grams alphabetically, I can observe that
the arpa file just stops at the 3-grams starting with "v" for n-2.
So, no, there is no end tag. In this case the last trigram in the
list happened to contain only legal words (last one before the line
suddenly ended is a legal letter so there is an unigram for it) so
therefore it's reasonable that in this case julius only had a 
problem with the missing end tag.

Interestingly, the arpa files with the missing ends have the same 
exact byte size of 2147483647 (see belov) which is the maximum 
value for a 32-bit signed integer.

-----------------------------------------------------------------
For experiment 2, can you repeat with a debug build and use a 
debugger to access the stack trace?  Without access to the data, 
it will be tricky to identify the cause.
-----------------------------------------------------------------

I used gdb/ddd to debug the process (though I am not really familiar
with this procedure). Here's the output:

me@computer:~> gdb estimate-ngram
GNU gdb (GDB) SUSE (6.8.91.20090930-2.4)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/me/bin/mitlm/estimate-ngram...done.
(gdb) run -order 4 -v 
/home/me/blob/lm/lm-tool-comparison/mitlm/mitlm.f4g.4.5mio.200k.opt/wlist -unk 
-t /home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.txt -opt-perp 
/home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.backoff.txt -wl 
mitlm.f4g.4.5mio.200k.opt.arpa -wc mitlm.f4g.4.5mio.200k.opt.arpa.counts
Starting program: /home/me/bin/mitlm/estimate-ngram -order 4 -v 
/home/me/blob/lm/lm-tool-comparison/mitlm/mitlm.f4g.4.5mio.200k.opt/wlist -unk 
-t /home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.txt -opt-perp 
/home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.backoff.txt -wl 
mitlm.f4g.4.5mio.200k.opt.arpa -wc mitlm.f4g.4.5mio.200k.opt.arpa.counts
0.000   Loading vocab 
/home/me/blob/lm/lm-tool-comparison/mitlm/mitlm.f4g.4.5mio.200k.opt/wlist...
0.170   Loading corpus 
/home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.txt...
717.120 Smoothing[1] = ModKN
717.120 Smoothing[2] = ModKN
717.120 Smoothing[3] = ModKN
717.120 Smoothing[4] = ModKN
717.120 Set smoothing algorithms...
731.510 Loading development set 
/home/me/blob/lm/corpus-normalized/dpa_plvm_ac623.bin1-2.backoff.txt...
estimate-ngram: src/vector/DenseVector.tcc:406: void 
DenseVector<T>::_allocate() [with T = int]: Assertion `_data' failed.

Program received signal SIGABRT, Aborted.
0xffffe425 in __kernel_vsyscall ()

-----------------------------------------------------------------
Finally, can you please let me know the sizes of all the input 
and output files so I can get a rough estimate of the scale of the 
data? 
-----------------------------------------------------------------

Full corpus 9.8 mio. sentences: 1.155 GB
Half corpus 4.5 mio. sentences: 0.5546 GB
All smaller sets (e.g., 9.3 mio. training + 0.5 mio. heldout) are subsets of 
the full corpus

Lexicon: 195.121 words

Exp.1: 
Arpa file: 2147483647 Byte (2.14 GB)
Count file: 1.542 GB

Exp.2:
Since the experiment crashes, there is no information here.

Exp.3:
Arpa file: 2147483647 Byte (2.14 GB)
Count file: 1.606 GB

----
Best regards.

Original comment by sebastia...@googlemail.com on 22 Jun 2010 at 9:45

GoogleCodeExporter commented 9 years ago
It sounds like you are encountering some sort of 2GB file limit.  MITLM itself 
appears to be fine until it attempts to write the LM to file.

Can you please verify that you can generate larger files in the same directory 
concatenating two large files?
  cat file1 file2 > file3

Can you also try running this on a different machine?  To isolate the issue, 
try to run everything on local disk instead of over the network.  If this still 
doesn't work, can you please describe your system configuration?  I currently 
suspect most of these issues result from some file system limit.

Original comment by bojune...@gmail.com on 22 Jun 2010 at 4:36

GoogleCodeExporter commented 9 years ago
Hi again,

--------------------------------------------------------
Can you please verify that you can generate larger files 
in the same directory concatenating two large files?
--------------------------------------------------------

Yes. It is possible.

--------------------------------------------------------
Can you also try running this on a different machine?  
--------------------------------------------------------

I tried Exp. 1 and Exp. 3  on my local machine (not the opteron) and got the 
assertions (see #4). So there is definetely an effect when switching 
to a machine with lesser performance in conjunction with this
assertion-failed-error. The local system under test was an 
IBM Thinkpad with some 1.5 GHz Single Core with 512 MB. So I 
guess the problem with the 4-grams of experiment 2 (#4) are related 
to a performance problem. Maybe I should try it with some cutoff for the 
3-gram and/or 4-gram counts. By the way: MITLM does not support such
a cutoff feature, right?

--------------------------------------------------------
To isolate the issue, try to run everything on local disk 
instead of over the network.  
--------------------------------------------------------

I did that for Exp. 1 and 2. I had the exact same problem as before. 
Creating the model seems to work fine but there is some error 
when writing the model to a disk.
The arpa file only contains the first 2147483647 Bytes as well. 
On this particular local disk I successfully repeated the "cat" union 
of two 2.14 GB file. No disk limitation here. The machine I used 
is described below.

--------------------------------------------------------
If this still doesn't work, can you please describe your 
system configuration?  I currently suspect most of these 
issues result from some file system limit.
--------------------------------------------------------

Here it is:

meminfo:

MemTotal:        7934804 kB
MemFree:         1518820 kB
Buffers:             748 kB
Cached:          3011228 kB
SwapCached:         2104 kB
Active:          5237372 kB
Inactive:        1008168 kB
Active(anon):    2752328 kB
Inactive(anon):   481252 kB
Active(file):    2485044 kB
Inactive(file):   526916 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      16771820 kB
SwapFree:       16706040 kB
Dirty:                16 kB
Writeback:             0 kB
AnonPages:       3232788 kB
Mapped:            16528 kB
Slab:             113032 kB
SReclaimable:      92724 kB
SUnreclaim:        20308 kB
PageTables:        13408 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    20739220 kB
Committed_AS:    4195476 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       34744 kB
VmallocChunk:   34359698355 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       10240 kB
DirectMap2M:     8116224 kB

cpuinfo (identically for all 4 cores):

processor       : 0                                                             

vendor_id       : AuthenticAMD                                                  

cpu family      : 16                                                            

model           : 2                                                             

model name      : Quad-Core AMD Opteron(tm) Processor 2360 SE                   

stepping        : 3                                                             

cpu MHz         : 2500.253                                                      

cache size      : 512 KB                                                        

fpu             : yes                                                           

fpu_exception   : yes                                                           

cpuid level     : 5                                                             

wp              : yes                                                           

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt rdtscp lm 
3dnowext 3dnow constant_tsc rep_good tsc_reliable nonstop_tsc extd_apicid pni 
cx16 popcnt lahf_lm extapic abm sse4a misalignsse 3dnowprefetch                 

bogomips        : 5000.20                                                       

TLB size        : 1024 4K pages                                                 

clflush size    : 64                                                            

cache_alignment : 64                                                            

address sizes   : 36 bits physical, 48 bits virtual                             

power management: ts ttp tm stc 100mhzsteps hwpstate                            

local disk in use:

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1            103210940   8668968  89299164   9% /data

other info:

Linux computer 2.6.31.12-0.2-desktop #1 SMP PREEMPT 2010-03-16 21:25:39 +0100 
x86_64 x86_64 x86_64 GNU/Linux
Linux version 2.6.31.12-0.2-desktop (geeko@buildhost) (gcc version 4.4.1 
[gcc-4_4-branch revision 150839] (SUSE Linux) ) #1 SMP PREEMPT 2010-03-16 
21:25:39 +0100
openSUSE 11.2 (x86_64)
VERSION = 11.2

--------------------------------------------------------

Thanks for your help!
Best regards.

Original comment by sebastia...@googlemail.com on 23 Jun 2010 at 12:30

GoogleCodeExporter commented 9 years ago
Running out of ideas.  A few things you can try:
- Write to .arpa.gz instead of .arpa.  MITLM will automatically compress the 
file using gzip.  My hypothesis is that we will still hit the problem at 2GB, 
if the compressed file exceeds that size.
- Verify that we are indeed encountering an OS file size issue by creating a 
simple C++ executable that writes a large file using the same technique as 
MITLM.  See src/util/ZFile.h.

No, MITLM does not currently support count cutoffs.  It should work if you 
manually remove n-grams from the count file though, although I have not tried 
it before.  You will likely have to tune the discount parameters in this case 
as the estimation from count of count statistics will not apply.

Paul

Original comment by bojune...@gmail.com on 23 Jun 2010 at 4:30

GoogleCodeExporter commented 9 years ago
Hi Paul,
just to inform you: I am going to further investigate this issue as soon as I 
am done with some other things. Unfortunately there were some urgent issues 
here and there - as always.
Thanks for the help so far. 

Original comment by sebastia...@googlemail.com on 5 Jul 2010 at 2:17

GoogleCodeExporter commented 9 years ago
Hi Paul,

I tested possibility 1 enabling me to build an arpa file (3-gram, 9.7mio sent., 
0.1mio for heldout) beyond the size of 2147483647 Bytes. The final arpa file 
(after successfully unpacking) had a size of 2167945068 Bytes whereas the gz 
file had a size below the 2.14GB threshold. When I try to create a model with 
the same parameters by writing the results directly to the arpa file I have the 
same issues than before - the arpa file is missing any information "behind" 
byte #2147483647

Finally i tried to create the "big" 3-gram model using 9.8 million sentences 
without heldout (this always crashed when writing directly to arpa) and writing 
the result to a gz file. The gz file had a size of 686332293 Bytes and 
2345369756 Bytes after unpacking.

So at least that solved the problem of creating big models without really 
identifying the reason for the problem. I will go into the source code. Maybe I 
can find a hint there.

Best regards.

Original comment by sebastia...@googlemail.com on 9 Jul 2010 at 11:29

GoogleCodeExporter commented 9 years ago
Sounds like you have temporarily avoided the 2GB limit issue for now since both 
.gz are below 2GB.

I am still troubled by the 2GB issue.  Can you please verify that you are 
building mitlm for 64bit?

Original comment by bojune...@gmail.com on 9 Jul 2010 at 4:40

GoogleCodeExporter commented 9 years ago
Well, i am not sure if I then missed something. I used the x86_64 machine above 
doing the following:
- Install gcc-fortran (http://code.google.com/p/mitlm/issues/detail?id=14)
- Got patch 
http://mitlm.googlecode.com/issues/attachment?aid=-4077538411916343304&name=mitl
m.0.4.ubuntu9.10.patch&token=be5f533c528fedb9ab1bd9065f4666d7
- Applied patch: patch -p1 -i mitlm.0.4.ubuntu9.10.patch
- make clean
- make DEBUG=1

Best regards.

Original comment by sebastia...@googlemail.com on 11 Jul 2010 at 2:47

GoogleCodeExporter commented 9 years ago
Hi, I have problem with smoothing large files of 3-grams counts.
Problem is that mitlm save files 1,5GB and then gave me this error:

I use :estimate-ngram -order 3 -counts allgrams -smoothing FixModKN -wl 
allgrams.FixModKN.lm command and i get this error:

Saving LM to train.corpus.lm...
estimate-ngram: src/NgramModel.cpp:422: void NgramModel::SaveLM(const 
std::vector<DenseVector<double>, std::allocator<DenseVector<double> > >&, const 
std::vector<DenseVector<double>, std::allocator<DenseVector<double> > >&, 
ZFile&) const: Assertion `(size_t)(ptr - lineBuffer.data()) < 
lineBuffer.size()' failed.

Before I tried on 2-grams with 4,7GB files and it works fine. 3-grams file is 
20GB big.

My operating system is GNU/Linux x86_64 with 96GB RAM. 

Original comment by Roksana....@gmail.com on 19 Nov 2012 at 10:24