halaxa / json-machine

Efficient, easy-to-use, and fast PHP JSON stream parser
Apache License 2.0
1.08k stars 65 forks source link

Parsing a google poly dump json #67

Closed yosun closed 2 years ago

yosun commented 2 years ago

So I'm trying to parse a very large json that is effectively one dimensional element series (though with nested parameters belong to that particular element)

https://drive.google.com/drive/folders/1PA9_Hq1Te7aBUoPurVdICKuCuY5Peka8?usp=sharing

Memory issues seem to happen intermittently. Really would love to be able to just foreach this, but json-machine seems to die with this even after setting PHP to higher memory limits

Adventures so far https://gist.github.com/yosun/d1ef6ef56943bd2417b07f4970ff7447

halaxa commented 2 years ago

Hi, here's what I've just tried with your file:

$items = JsonMachine::fromFile(__DIR__ . '/metadata_unique_all.json');
foreach ($items as $i => $item) {
    if ($i % 1000 === 0) {
        fwrite(STDOUT, memory_get_usage(true).":".memory_get_peak_usage(true).PHP_EOL);
    }
}

It gets to the end without any problem and reported memory consumption is nearly constant. The output looks like this:

...
2097152:2097152
2097152:2097152
2097152:2097152
4194304:4194304
4194304:4194304
4194304:4194304
...

Peak usage at the end jumps to just over 4 MB in my case. Can you try this code?

yosun commented 2 years ago

I think the problem might also be that I'm trying to scrape all the resource links, so the lower the memory footprint the large jsonreader is, the better.

halaxa commented 2 years ago

Yes, might be. User code is often responsible for memory leaks. But thanks anyway, I did some checking and tried to return some memory earlier in Parser to prevent it even more.