ksOAn6g5 / TaiSu

TaiSu(太素)--a large-scale Chinese multimodal dataset(亿级大规模中文视觉语言预训练数据集)
Other
172 stars 11 forks source link

How to load the large json file? #5

Closed wanng-ide closed 1 year ago

wanng-ide commented 1 year ago

filtered_web_generated_captions.json is toooo large, 30GB.

My code is

df = pd.read_json('filtered_web_generated_captions.json', orient='index')

Then,

Segmentation fault (core dumped)

Can you provide a parquet or arrow version? or a huggingface dataset version of this?

YulongBonjour commented 1 year ago

Here are some tutorials: https://blog.csdn.net/Threeyearsago/article/details/104763329 https://www.cnblogs.com/sk-lqbzblogs/p/15979192.html https://pypi.org/project/ijson/

wanng-ide commented 1 year ago

Here are some tutorials: https://blog.csdn.net/Threeyearsago/article/details/104763329 https://www.cnblogs.com/sk-lqbzblogs/p/15979192.html https://pypi.org/project/ijson/

Thanks for your reply.

I tried a lot of methods. But, they does not work.

The content of that json file only include a line: For example: '{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}, ..........}' I guess that this json file have 160M indexes in a single file.... Thus, there will be a problem to use your mentioned methods.

fanOfJava commented 1 year ago

The same problem! Do not know how to parse it

ksOAn6g5 commented 1 year ago

we will release a new json file that can be read line by line.

Here are some tutorials: https://blog.csdn.net/Threeyearsago/article/details/104763329 https://www.cnblogs.com/sk-lqbzblogs/p/15979192.html https://pypi.org/project/ijson/

fanOfJava commented 1 year ago

we will release a new json file that can be read line by line.

Here are some tutorials: https://blog.csdn.net/Threeyearsago/article/details/104763329 https://www.cnblogs.com/sk-lqbzblogs/p/15979192.html https://pypi.org/project/ijson/

Can you using these methods to parse your json file?

ksOAn6g5 commented 1 year ago

we updated a new JSON file for captions.---- filtered_web_and_generated_captions_with_indent4.json This file can be read line by line.

fanOfJava commented 1 year ago

we updated a new JSON file for captions.---- filtered_web_and_generated_captions_with_indent4.json This file can be read line by line.

still can not be read line by line. I suggest you provider a script for parsing your json

ksOAn6g5 commented 1 year ago

I have tried the following codes, and it works. Please have try: ''' json file format: line1 { line2 id1:{'g':xxx,'w':xxx}, line3 id2:{'g':xxx,'w':xxx}, ..... .... line N } ''' import json path1 ='path to file ' with open(path1, 'r', encoding='utf-8') as f: try: while True : for line_data in f: if ':' in line_data : print(line_data)

data= json.loads(line_data)#load to json

             #other operations
             # ...
           else:
             break
except Exception as e:
    print(e)
    f.close()
ksOAn6g5 commented 1 year ago

i updated the readeME

fanOfJava commented 1 year ago

Thanks for your reply. The code still does not work for me. I think the problem is that the big json contains only one line, and using the code 'for line in f', it will load the whole file into the memory, and stuck

ksOAn6g5 commented 1 year ago

filtered_web_and_generated_captions_with_indent4.json has 160M lines. Have you downloaded the new json file? I have successfully printed each line.

fanOfJava commented 1 year ago

filtered_web_and_generated_captions_with_indent4.json has 160M lines. Have you downloaded the new json file? I have successfully printed each line.

I downloaded the json file. Can not print every line

ksOAn6g5 commented 1 year ago

is there any erro information?

fanOfJava commented 1 year ago

is there any erro information?

no. And the md5 of this new json file is 'ebd1c12c4abb97805885ac86b4d49a0f'

fanOfJava commented 1 year ago

And I using the code below to calculate the line of the big json, the output is 1 "num = 0 with open(input_file,'r',encoding='utf-8') as f_reader: for line in f_reader: num+=1 print(num) "

ksOAn6g5 commented 1 year ago

Yes,you are right, the json file has only one line. It seems that it automatically changed from multiple lines too jsut one line. and my server has a huge RAM ,so i didn't find it.

fanOfJava commented 1 year ago

I think you can save the label information to a big text file. And every line of the text file is a json string that contains a image and its label

ksOAn6g5 commented 1 year ago

A new TXT file is released. This time the file has a much smaller size and can be read line by line.