Closed tymzar closed 5 months ago
Hard to say. That shouldn't happen though. Try running the program in gdb and getting a stack trace to see what's going on.
@davisking I will do that (a will post results an approx. 2h). Do you roughly remember the size od dataset that English or Spanish model was trained on as well as the RAM of the machine?
Not sure. I want to say the dataset was like 40GB maybe. And the machine had 128GB of ram maybe. It was a long time ago though so take that with a grain of salt.
Okay, thank you is there any formula or approximation that I can calculate needed amount of RAM (A want to train a model for Polish on dataset 50-80GB). Do you know how big total_word_feature_extractor would be?
I don't recall. It should be linear though. Try some sizes and see what happens. But I guess you are saying you think you are just running out of RAM? I would normally expect a bus error to be something else, but I'm not sure if OS X just reports the error differently.
@davisking , on 4Gb dataset, for me it looks like OpenBLAS running out of memory. I wan monitoring the RAM and swap was full but memory usage was about 40% what's strange :/
I will try to increase the dataset incrementally and see when it fails. I have no idea.
lldb -- ./wordrep -e <path>
(lldb) target create "./wordrep"
Current executable set to '<path>' (arm64).
(lldb) settings set -- target.run-args "-e" "<path>"
(lldb) run
Process 46942 launched: '<path>' (arm64)
number of raw ASCII files found: 76
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 76
Sample 50000000 random context vectors
Now do CCA (left size: 50000000, right size: 50000000).
Process 46942 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6e01ccc1fc)
frame #0: 0x0000000192905900 libLAPACK.dylib`SLARFT + 400
libLAPACK.dylib`SLARFT:
-> 0x192905900 <+400>: ldr s0, [x27, x23, lsl #2]
0x192905904 <+404>: fcmp s0, #0.0
0x192905908 <+408>: b.ne 0x192905974 ; <+516>
0x19290590c <+412>: sub x23, x23, #0x1
Target 0: (wordrep) stopped.
@davisking I knew from some of the issues that MITIE requires a lot of RAM, but that's quite surprising ://
number of raw ASCII files found: 1
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 1
Sample 50000000 random context vectors
Now do CCA (left size: 8582326, right size: 8582326).
correlations: 0.783697 0.495714 0.428661 0.417896 0.399711 0.308799 0.257686 0.241372 0.214332 0.206914 0.180628 0.151268 0.143147 0.135543 0.129035 0.11404 0.104493 0.094976 0.0889702 0.081165 0.0765073 0.0743562 0.0730994 0.0653017 0.0622959 0.0602029 0.0504991 0.0483258 0.0475654 0.0458159 0.0405218 0.0396676 0.0377837 0.035018 0.0321907 0.0302941 0.0294151 0.0272736 0.0250081 0.023913 0.0227566 0.0216142 0.0207634 0.0199015 0.0191722 0.0172601 0.0167749 0.0165069 0.0156701 0.0152914 0.0150729 0.0147807 0.0135261 0.013273 0.0125623 0.0120217 0.0117249 0.0115241 0.010776 0.0105152 0.0102252 0.00982769 0.00967478 0.00906732 0.00888962 0.00882777 0.00853846 0.00810019 0.00803891 0.00766455 0.0073715 0.00711813 0.00686214 0.0067737 0.00648305 0.00637957 0.00621849 0.00609243 0.00578847 0.00560462 0.00551808 0.00540755 0.00527975 0.0051427 0.00495427 0.0048308 0.00470006 0.00460154 0.00457549 0.00444141
CCA done, now build up average word vectors
num words: 200000
num word vectors loaded: 200000
got word vectors, now learn how they correlate with morphological features.
building morphological vectors
L.size(): 200000
R.size(): 200000
Now running CCA on word <-> morphology...
correlations: 0.972561 0.671965 0.612678 0.579442 0.505745 0.410469 0.370399 0.320987 0.303507 0.295264 0.284905 0.272496 0.260294 0.252493 0.247422 0.243564 0.224549 0.215319 0.211788 0.202069 0.198979 0.193592 0.187116 0.179735 0.177608 0.173987 0.167869 0.165495 0.159846 0.157329 0.152932 0.148687 0.146318 0.144891 0.1425 0.140035 0.138354 0.137298 0.13551 0.133037 0.131869 0.129603 0.129255 0.126594 0.125905 0.123318 0.119747 0.116908 0.116507 0.115395 0.114808 0.111849 0.111262 0.108799 0.108121 0.106949 0.105413 0.103576 0.103322 0.102464 0.101616 0.100574 0.100245 0.0993405 0.0986815 0.0981801 0.0973369 0.0969129 0.0962343 0.0956761 0.0950916 0.0949497 0.0937699 0.0930668 0.092798 0.0924761 0.0915229 0.0906338 0.0902752 0.0897365 0.0893135 0.0891064 0.0884818 0.0879273 0.0873452 0.0871056 0.0864901 0.0862346 0.0858451 0.0855675
morphological feature dimensionality: 90
total word feature dimensionality: 271
number of raw ASCII files found: 2 num words: 200000 saving word counts to top_word_counts.dat number of raw ASCII files found: 2 Sample 50000000 random context vectors Now do CCA (left size: 17204027, right size: 17204027). correlations: 0.779923 0.512237 0.436919 0.423704 0.412156 0.32337 0.260955 0.245038 0.222629 0.220487 0.186529 0.155599 0.145268 0.140225 0.1293 0.11533 0.10416 0.0969533 0.0920494 0.0812887 0.0788473 0.0760371 0.0711883 0.0664675 0.0616015 0.0582108 0.0499312 0.0481921 0.0471149 0.0457483 0.0405427 0.0391695 0.0366974 0.0343238 0.0317766 0.0301706 0.0284668 0.02708 0.024634 0.0230824 0.0221957 0.0211691 0.0201103 0.0192433 0.0177503 0.0170046 0.0165348 0.0160024 0.0154991 0.014842 0.0144806 0.0140849 0.0132977 0.0126606 0.012073 0.0118559 0.0112087 0.0108456 0.0105672 0.0102032 0.0100555 0.0096982 0.00935652 0.00908955 0.00844935 0.00818099 0.00796244 0.00762995 0.00753236 0.0072838 0.00706413 0.00690194 0.00681546 0.00654413 0.00626689 0.00608943 0.00596117 0.00557491 0.00536563 0.0051692 0.00503521 0.00487138 0.00479157 0.00464975 0.0042314 0.00396397 0.00389778 0.00373412 0.0036037 0.00347846 CCA done, now build up average word vectors num words: 200000 num word vectors loaded: 200000 got word vectors, now learn how they correlate with morphological features. building morphological vectors L.size(): 200000 R.size(): 200000 Now running CCA on word <-> morphology... correlations: 0.978974 0.713797 0.657182 0.632944 0.558261 0.47269 0.425248 0.372064 0.351736 0.339963 0.320005 0.311463 0.30237 0.294466 0.289986 0.280466 0.265051 0.25838 0.247886 0.235087 0.232196 0.225109 0.224001 0.212538 0.204304 0.202062 0.191768 0.184285 0.182187 0.181151 0.174696 0.169313 0.167562 0.16549 0.160309 0.157552 0.154192 0.152316 0.150774 0.146907 0.144155 0.142709 0.141969 0.138627 0.137163 0.134016 0.132602 0.129624 0.12818 0.126129 0.125144 0.12363 0.122081 0.120231 0.118327 0.116443 0.115346 0.114512 0.113029 0.112386 0.111879 0.111077 0.110131 0.108435 0.107854 0.107057 0.106172 0.105093 0.104389 0.103112 0.102088 0.101712 0.100469 0.0994505 0.0989811 0.0984102 0.0982125 0.0972503 0.096731 0.0956087 0.0950048 0.094337 0.0939208 0.0925872 0.0919503 0.0914311 0.0909645 0.0905782 0.0905216 0.0894758
morphological feature dimensionality: 90 total word feature dimensionality: 271
number of raw ASCII files found: 3
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 3
Sample 50000000 random context vectors
Now do CCA (left size: 25828623, right size: 25828623).
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6f1a629238)
frame #0: 0x0000000192905900 libLAPACK.dylib`SLARFT + 400
libLAPACK.dylib`SLARFT:
-> 0x192905900 <+400>: ldr s0, [x27, x23, lsl #2]
0x192905904 <+404>: fcmp s0, #0.0
0x192905908 <+408>: b.ne 0x192905974 ; <+516>
0x19290590c <+412>: sub x23, x23, #0x1
number of raw ASCII files found: 4
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 4
Sample 50000000 random context vectors
Now do CCA (left size: 34482189, right size: 34482189).
Process 49420 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6f1d74b3b0)
frame #0: 0x0000000192905900 libLAPACK.dylib`SLARFT + 400
libLAPACK.dylib`SLARFT:
-> 0x192905900 <+400>: ldr s0, [x27, x23, lsl #2]
0x192905904 <+404>: fcmp s0, #0.0
0x192905908 <+408>: b.ne 0x192905974 ; <+516>
0x19290590c <+412>: sub x23, x23, #0x1
Target 0: (wordrep) stopped.
Im not so sure what to do next, do you have any tips?
But I guess you are saying you think you are just running out of RAM? I would normally expect a bus error to be something else, but I'm not sure if OS X just reports the error differently.
RAM was my initial guess (because I see that it's capped to 40% and swap is drained), but I can be mistaken :c
@davisking do you have any ideas?
No idea. You will have to debug into it and see what the deal is.
Expected Behavior
Hi, I am trying to use wordrep on part in future maybe full
CC100-Polish Dataset
(~47 GB).wordrep
is freshly complied with no errors and work properly for the files around ~40mb. But when I try to use bigger chunks ex. 2GB I encounter some problems... I guess they are not hardware related because I has monitoring memory usage viahtop
and there was some room to spare (Mem for some reason is limited to 20%, swp takes 45/48GB).Current Behavior
After running
wordrep
on my directory process failed on CCA.NOTE: process takes about 20min on my machine before crashing
Steps to Reproduce
Build locally
wordrep
from the source code.Run
./wordrep -e ../<path to directory>
.The directory contains 2GB chunk of CC100-PL
Version: 0.7.0,
Where did you get MITIE: Complied from GitHub repo (from clean terminal),
Platform: Chip: Apple M2 Pro, Total Number of Cores: 10 (6 performance and 4 efficiency), Memory: 32 GB, System Version: macOS 14.0 (23A344), Kernel Version: Darwin 23.0.0
Compiler: Apple clang version 15.0.0 (clang-1500.0.40.1), Target: arm64-apple-darwin23.0.0