gristlabs / grist-core

Grist is the evolution of spreadsheets.
https://www.getgrist.com/
Apache License 2.0
7.02k stars 311 forks source link

Fail to import large CSV, no documentation or reason #901

Open fulldecent opened 6 months ago

fulldecent commented 6 months ago

I am using Grist as recommended with this omnibus setup.

Importing a small CSV file was successful using this:

Screenshot 2024-03-18 at 12 21 59

However, importing a large CSV file failed.

Screenshot 2024-03-18 at 12 21 53

The file I need to load is: 200MB, 30 columns, 400,000 rows.


Work plan

gabriel-v commented 6 months ago

Wanted to say the same thing. Running docker container capped at 10G ram.

Here is a 2GB csv file download:

https://catalog.data.gov/dataset/crimes-2001-to-present/resource/31b027d7-b633-4e82-ad2e-cfa5caaf5837

https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD


Tried both with and without sandboxing.

With sandbox, I get this on logs pretty quick:

2024-03-20 03:44:43.910 - warn: Sandbox unexpectedly exited with code 1 signal null sandboxPid=684, flavor=gvisor, command=undefined, entryPoint=(default), plugin=builtIn/core, docId=vdvs6jqzwmuzosEcHdoJoY
2024-03-20 03:44:43.927 - warn: Error during api call to /workspaces/2/import: Failed to parse CSV file.
Error: [Sandbox] PipeFromSandbox is closed:     raise Exception('gvisor runsc problem: ' + json.dumps(command))

I guess there's a low implicit memory limit for the sandbox?


With no sandbox, I get this:

image

2024-03-20 03:50:04.240 - warn: Sandbox unexpectedly exited with code null signal SIGKILL sandboxPid=23, flavor=unsandboxed, command=undefined, entryPoint=(default), plugin=builtIn/core, docId=vdvs6jqzwmuzosEcHdoJoY

and dmesg

[26021.168950] Memory cgroup out of memory: Killed process 137254 (python3.11) total-vm:10179908kB, anon-rss:10018296kB, file-rss:6912kB, shmem-rss:0kB, UID:0 pgtables:19992kB oom_score_adj:0

So it's using more than 10GB ram to parse the 2GB CSV file. Let's give it 20GB...

Screenshot from 2024-03-20 06-02-13

aaaand, boom

[26749.723969] Memory cgroup out of memory: Killed process 141164 (python3.11) total-vm:20545332kB, anon-rss:20259880kB, file-rss:7040kB, shmem-rss:0kB, UID:0 pgtables:40788kB oom_score_adj:0

Screenshot from 2024-03-20 06-02-50

Hey, it guessed the headers now...


Let's give it 40GB

Screenshot from 2024-03-20 06-23-32

No more OOM, that's nice. Container process seems to be using at most 21GB, even through the container group itself peaks at 31GB.

Still, after a couple minutes the UI breaks down:

Screenshot from 2024-03-20 06-08-55

And there's no data in the persist folder. That's probably why the recovery view is empty:

Screenshot from 2024-03-20 06-26-36


Ok, let's cut down on the file size. After truncating to first 600k rows (180MB):

Screenshot from 2024-03-20 06-34-45

Ram usage tops out at 10G (50X increase from file size).

Seems to be fine so far - but then shows this screen with no data, clicking the "new table" button on the left starts this spinner.

Screenshot from 2024-03-20 06-35-58

....

after 15min

Screenshot from 2024-03-20 06-41-40

GREAT SUCCESS! Still, clicking "ok" jumps into another spinner... I guess I'll wait 15min more.

Screenshot from 2024-03-20 06-46-25

Victory! We have almost a million chicago crimes now.

Here's the final ram usage for 180MB csv file:

Screenshot from 2024-03-20 06-49-30

Uploading 180MB csv takes 10GB server ram. Opening the doc then takes around 4GB of server ram, regardless of number of opened tabs for the same user.

Finally, the sqlite file is 2.1GB (13x amplification):

➜  persist du -hd1 docs/v*
2.1G    docs/vdvs6jqzwmuzosEcHdoJoY.grist

After uploading finished, I restarted the container with sandboxing enabled and reading & searching it works (while server container still takes 2-4GB of ram). If the sandbox has a low ram limit, I guess these 2-4GB of memory would be used by the nodejs server part?


I can see from this two things:

Both of these things can be fixed but it seems this was designed to fit everything into ram....


Related issues and comments

assuming no single document becomes too large

Yes but the crimes are happening at about 4MB CSV/day


@fulldecent turn off sandboxing and give your container at least 16gb of ram and it might work like above. Then turn sandboxing back on again.

Questions for the devs: