tymzar commented 11 months ago

Expected Behavior

Hi, I am trying to use wordrep on part in future maybe full CC100-Polish Dataset (~47 GB). wordrep is freshly complied with no errors and work properly for the files around ~40mb. But when I try to use bigger chunks ex. 2GB I encounter some problems... I guess they are not hardware related because I has monitoring memory usage via htop and there was some room to spare (Mem for some reason is limited to 20%, swp takes 45/48GB).

Current Behavior

After running wordrep on my directory process failed on CCA.

number of raw ASCII files found: 1
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 1
Sample 50000000 random context vectors
Now do CCA (left size: 50000000, right size: 50000000).
Bus error: 10

NOTE: process takes about 20min on my machine before crashing

Steps to Reproduce

Build locally wordrep from the source code.

cd tools/wordrep
mkdir build
cd build
cmake ..
cmake --build . --config Release

Run ./wordrep -e ../<path to directory>.
The directory contains 2GB chunk of CC100-PL
Version: 0.7.0,
Where did you get MITIE: Complied from GitHub repo (from clean terminal),
Platform: Chip: Apple M2 Pro, Total Number of Cores: 10 (6 performance and 4 efficiency), Memory: 32 GB, System Version: macOS 14.0 (23A344), Kernel Version: Darwin 23.0.0
Compiler: Apple clang version 15.0.0 (clang-1500.0.40.1), Target: arm64-apple-darwin23.0.0

davisking commented 11 months ago

Hard to say. That shouldn't happen though. Try running the program in gdb and getting a stack trace to see what's going on.

tymzar commented 11 months ago

@davisking I will do that (a will post results an approx. 2h). Do you roughly remember the size od dataset that English or Spanish model was trained on as well as the RAM of the machine?

davisking commented 11 months ago

Not sure. I want to say the dataset was like 40GB maybe. And the machine had 128GB of ram maybe. It was a long time ago though so take that with a grain of salt.

tymzar commented 11 months ago

Okay, thank you is there any formula or approximation that I can calculate needed amount of RAM (A want to train a model for Polish on dataset 50-80GB). Do you know how big total_word_feature_extractor would be?

davisking commented 11 months ago

I don't recall. It should be linear though. Try some sizes and see what happens. But I guess you are saying you think you are just running out of RAM? I would normally expect a bus error to be something else, but I'm not sure if OS X just reports the error differently.

tymzar commented 11 months ago

@davisking , on 4Gb dataset, for me it looks like OpenBLAS running out of memory. I wan monitoring the RAM and swap was full but memory usage was about 40% what's strange :/

I will try to increase the dataset incrementally and see when it fails. I have no idea.

lldb -- ./wordrep -e <path>
(lldb) target create "./wordrep"
Current executable set to '<path>' (arm64).
(lldb) settings set -- target.run-args  "-e" "<path>"
(lldb) run
Process 46942 launched: '<path>' (arm64)
number of raw ASCII files found: 76
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 76
Sample 50000000 random context vectors
Now do CCA (left size: 50000000, right size: 50000000).
Process 46942 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6e01ccc1fc)
    frame #0: 0x0000000192905900 libLAPACK.dylib`SLARFT + 400
libLAPACK.dylib`SLARFT:
->  0x192905900 <+400>: ldr    s0, [x27, x23, lsl  #2]
    0x192905904 <+404>: fcmp   s0, #0.0
    0x192905908 <+408>: b.ne   0x192905974               ; <+516>
    0x19290590c <+412>: sub    x23, x23, #0x1
Target 0: (wordrep) stopped.

tymzar commented 11 months ago

@davisking I knew from some of the issues that MITIE requires a lot of RAM, but that's quite surprising ://

dataset -> total_word_feature_extractor.dat

1. 50MB -> 335MB

number of raw ASCII files found: 1
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 1
Sample 50000000 random context vectors
Now do CCA (left size: 8582326, right size: 8582326).
correlations:   0.783697   0.495714   0.428661   0.417896   0.399711   0.308799   0.257686   0.241372   0.214332   0.206914   0.180628   0.151268   0.143147   0.135543   0.129035    0.11404   0.104493   0.094976  0.0889702   0.081165  0.0765073  0.0743562  0.0730994  0.0653017  0.0622959  0.0602029  0.0504991  0.0483258  0.0475654  0.0458159  0.0405218  0.0396676  0.0377837   0.035018  0.0321907  0.0302941  0.0294151  0.0272736  0.0250081   0.023913  0.0227566  0.0216142  0.0207634  0.0199015  0.0191722  0.0172601  0.0167749  0.0165069  0.0156701  0.0152914  0.0150729  0.0147807  0.0135261   0.013273  0.0125623  0.0120217  0.0117249  0.0115241   0.010776  0.0105152  0.0102252 0.00982769 0.00967478 0.00906732 0.00888962 0.00882777 0.00853846 0.00810019 0.00803891 0.00766455  0.0073715 0.00711813 0.00686214  0.0067737 0.00648305 0.00637957 0.00621849 0.00609243 0.00578847 0.00560462 0.00551808 0.00540755 0.00527975  0.0051427 0.00495427  0.0048308 0.00470006 0.00460154 0.00457549 0.00444141 
CCA done, now build up average word vectors
num words: 200000
num word vectors loaded: 200000
got word vectors, now learn how they correlate with morphological features.
building morphological vectors
L.size(): 200000
R.size(): 200000
Now running CCA on word <-> morphology...
correlations:  0.972561  0.671965  0.612678  0.579442  0.505745  0.410469  0.370399  0.320987  0.303507  0.295264  0.284905  0.272496  0.260294  0.252493  0.247422  0.243564  0.224549  0.215319  0.211788  0.202069  0.198979  0.193592  0.187116  0.179735  0.177608  0.173987  0.167869  0.165495  0.159846  0.157329  0.152932  0.148687  0.146318  0.144891    0.1425  0.140035  0.138354  0.137298   0.13551  0.133037  0.131869  0.129603  0.129255  0.126594  0.125905  0.123318  0.119747  0.116908  0.116507  0.115395  0.114808  0.111849  0.111262  0.108799  0.108121  0.106949  0.105413  0.103576  0.103322  0.102464  0.101616  0.100574  0.100245 0.0993405 0.0986815 0.0981801 0.0973369 0.0969129 0.0962343 0.0956761 0.0950916 0.0949497 0.0937699 0.0930668  0.092798 0.0924761 0.0915229 0.0906338 0.0902752 0.0897365 0.0893135 0.0891064 0.0884818 0.0879273 0.0873452 0.0871056 0.0864901 0.0862346 0.0858451 0.0855675 

morphological feature dimensionality: 90
total word feature dimensionality: 271

2. 100MB -> 337MB

number of raw ASCII files found: 2 num words: 200000 saving word counts to top_word_counts.dat number of raw ASCII files found: 2 Sample 50000000 random context vectors Now do CCA (left size: 17204027, right size: 17204027). correlations: 0.779923 0.512237 0.436919 0.423704 0.412156 0.32337 0.260955 0.245038 0.222629 0.220487 0.186529 0.155599 0.145268 0.140225 0.1293 0.11533 0.10416 0.0969533 0.0920494 0.0812887 0.0788473 0.0760371 0.0711883 0.0664675 0.0616015 0.0582108 0.0499312 0.0481921 0.0471149 0.0457483 0.0405427 0.0391695 0.0366974 0.0343238 0.0317766 0.0301706 0.0284668 0.02708 0.024634 0.0230824 0.0221957 0.0211691 0.0201103 0.0192433 0.0177503 0.0170046 0.0165348 0.0160024 0.0154991 0.014842 0.0144806 0.0140849 0.0132977 0.0126606 0.012073 0.0118559 0.0112087 0.0108456 0.0105672 0.0102032 0.0100555 0.0096982 0.00935652 0.00908955 0.00844935 0.00818099 0.00796244 0.00762995 0.00753236 0.0072838 0.00706413 0.00690194 0.00681546 0.00654413 0.00626689 0.00608943 0.00596117 0.00557491 0.00536563 0.0051692 0.00503521 0.00487138 0.00479157 0.00464975 0.0042314 0.00396397 0.00389778 0.00373412 0.0036037 0.00347846 CCA done, now build up average word vectors num words: 200000 num word vectors loaded: 200000 got word vectors, now learn how they correlate with morphological features. building morphological vectors L.size(): 200000 R.size(): 200000 Now running CCA on word <-> morphology... correlations: 0.978974 0.713797 0.657182 0.632944 0.558261 0.47269 0.425248 0.372064 0.351736 0.339963 0.320005 0.311463 0.30237 0.294466 0.289986 0.280466 0.265051 0.25838 0.247886 0.235087 0.232196 0.225109 0.224001 0.212538 0.204304 0.202062 0.191768 0.184285 0.182187 0.181151 0.174696 0.169313 0.167562 0.16549 0.160309 0.157552 0.154192 0.152316 0.150774 0.146907 0.144155 0.142709 0.141969 0.138627 0.137163 0.134016 0.132602 0.129624 0.12818 0.126129 0.125144 0.12363 0.122081 0.120231 0.118327 0.116443 0.115346 0.114512 0.113029 0.112386 0.111879 0.111077 0.110131 0.108435 0.107854 0.107057 0.106172 0.105093 0.104389 0.103112 0.102088 0.101712 0.100469 0.0994505 0.0989811 0.0984102 0.0982125 0.0972503 0.096731 0.0956087 0.0950048 0.094337 0.0939208 0.0925872 0.0919503 0.0914311 0.0909645 0.0905782 0.0905216 0.0894758

morphological feature dimensionality: 90 total word feature dimensionality: 271

3. 150MB -> failed

number of raw ASCII files found: 3
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 3
Sample 50000000 random context vectors
Now do CCA (left size: 25828623, right size: 25828623).
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6f1a629238)
    frame #0: 0x0000000192905900 libLAPACK.dylib`SLARFT + 400
libLAPACK.dylib`SLARFT:
->  0x192905900 <+400>: ldr    s0, [x27, x23, lsl  #2]
    0x192905904 <+404>: fcmp   s0, #0.0
    0x192905908 <+408>: b.ne   0x192905974               ; <+516>
    0x19290590c <+412>: sub    x23, x23, #0x1

4. 200MB -> failed

number of raw ASCII files found: 4
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 4
Sample 50000000 random context vectors
Now do CCA (left size: 34482189, right size: 34482189).
Process 49420 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6f1d74b3b0)
    frame #0: 0x0000000192905900 libLAPACK.dylib`SLARFT + 400
libLAPACK.dylib`SLARFT:
->  0x192905900 <+400>: ldr    s0, [x27, x23, lsl  #2]
    0x192905904 <+404>: fcmp   s0, #0.0
    0x192905908 <+408>: b.ne   0x192905974               ; <+516>
    0x19290590c <+412>: sub    x23, x23, #0x1
Target 0: (wordrep) stopped.

Im not so sure what to do next, do you have any tips?

tymzar commented 11 months ago

But I guess you are saying you think you are just running out of RAM? I would normally expect a bus error to be something else, but I'm not sure if OS X just reports the error differently.

RAM was my initial guess (because I see that it's capped to 40% and swap is drained), but I can be mistaken :c

tymzar commented 11 months ago

@davisking do you have any ideas?

davisking commented 11 months ago

No idea. You will have to debug into it and see what the deal is.

mit-nlp / MITIE

Unexpected `Bus error: 10` while using `wordrep` #217

Expected Behavior

Current Behavior

Steps to Reproduce

dataset -> total_word_feature_extractor.dat

1. 50MB -> 335MB

2. 100MB -> 337MB

3. 150MB -> failed

4. 200MB -> failed