halaxa / json-machine

Efficient, easy-to-use, and fast PHP JSON stream parser
Apache License 2.0
1.08k stars 65 forks source link

File parsing takes too long. #96

Closed KKKKK-tech closed 1 year ago

KKKKK-tech commented 1 year ago

Hi, This may be a stupid question, but this problem needs to be solved urgently so I opened this issue…I have a very large json file, about 5GB with 30 million lines. I tried parsing the json file with jsonmachine, but it seemed to take so long that I got an Internal Server Error in the browser. I noticed in the readme file that 100GB files can also be parsed, but I'm not sure how to write code since I'm not very good at php. My json file format is roughly as follows: { "head":{...}, "PTHT0000001":{"CDD":[...],"SMART":[...]}, ..., "PTHT0012803":{"CDD":[...],"SMART":[...]} } My goal is to find a unique PTHTxxxxxxx and extract its value. How should I parse it? Thank you very much!

pkoppstein commented 1 year ago

The jm script (https://github.com/pkoppstein/jm) is based on JSON Machine, and could be used as follows to find the value of a specific key:

$ jm -s | grep --max-count 1 '^{"PTHT0000001"'

Even if you don't want to use jm itself, you could examine it to see how it accomplishes what you want.

Alternatively, you might consider using the --stream option of jq (https://github.com/stedolan/jq), which is designed for just this kind of problem:

$ jq -n --stream 'first(fromstream( (inputs | select(.[0][0] == "PTHT0000001")), [["PTHT0000001"]]  ))'
halaxa commented 1 year ago

If you need to use it from inside PHP, just use simple foreach and find your key there.

foreach (Items::fromFile('500gb.json') as $key => $item) {
    if ($key === "PTHT0012803") {
        // your code
    }
}

Keep in mind, that a file of this size might get hours to parse with JSON Machine. I guess 2-4 depending on the machine and PHP configuration. You also might be interested in #97.

halaxa commented 1 year ago

Sorry, I read 500 GB instead of just 5 GB. Then it should be a matter of minutes. Make sure xdebug is disabled and JIT enabled.

Also make longer your php time limit if you parse from browser.

KKKKK-tech commented 1 year ago

The jm script (https://github.com/pkoppstein/jm) is based on JSON Machine, and could be used as follows to find the value of a specific key:

$ jm -s | grep --max-count 1 '^{"PTHT0000001"'

Even if you don't want to use jm itself, you could examine it to see how it accomplishes what you want.

Alternatively, you might consider using the --stream option of jq (https://github.com/stedolan/jq), which is designed for just this kind of problem:

$ jq -n --stream 'first(fromstream( (inputs | select(.[0][0] == "PTHT0000001")), [["PTHT0000001"]]  ))'

Thank you for your help! I will try the method you mentioned later.

KKKKK-tech commented 1 year ago

If you need to use it from inside PHP, just use simple foreach and find your key there.

foreach (Items::fromFile('500gb.json') as $key => $item) {
    if ($key === "PTHT0012803") {
        // your code
    }
}

Keep in mind, that a file of this size might get hours to parse with JSON Machine. I guess 2-4 depending on the machine and PHP configuration. You also might be interested in #97.

Thank you for your reply. I used code like this before, but it took too long. I think it may be because foreach takes too much time. Is there any way to avoid this situation in jsonmachine? If not, I will try to split large files.

halaxa commented 1 year ago

If you split them, it will take about the same time anyway. There is no faster solution in JSON Machine for now. Keep up with #97 which should bring some speedup.