Closed wanng-ide closed 1 year ago
Here are some tutorials: https://blog.csdn.net/Threeyearsago/article/details/104763329 https://www.cnblogs.com/sk-lqbzblogs/p/15979192.html https://pypi.org/project/ijson/
Thanks for your reply.
I tried a lot of methods. But, they does not work.
The content of that json file only include a line: For example: '{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}, ..........}' I guess that this json file have 160M indexes in a single file.... Thus, there will be a problem to use your mentioned methods.
The same problem! Do not know how to parse it
we will release a new json file that can be read line by line.
Here are some tutorials: https://blog.csdn.net/Threeyearsago/article/details/104763329 https://www.cnblogs.com/sk-lqbzblogs/p/15979192.html https://pypi.org/project/ijson/
we will release a new json file that can be read line by line.
Here are some tutorials: https://blog.csdn.net/Threeyearsago/article/details/104763329 https://www.cnblogs.com/sk-lqbzblogs/p/15979192.html https://pypi.org/project/ijson/
Can you using these methods to parse your json file?
we updated a new JSON file for captions.---- filtered_web_and_generated_captions_with_indent4.json This file can be read line by line.
we updated a new JSON file for captions.---- filtered_web_and_generated_captions_with_indent4.json This file can be read line by line.
still can not be read line by line. I suggest you provider a script for parsing your json
I have tried the following codes, and it works. Please have try: ''' json file format: line1 { line2 id1:{'g':xxx,'w':xxx}, line3 id2:{'g':xxx,'w':xxx}, ..... .... line N } ''' import json path1 ='path to file ' with open(path1, 'r', encoding='utf-8') as f: try: while True : for line_data in f: if ':' in line_data : print(line_data)
#other operations
# ...
else:
break
except Exception as e:
print(e)
f.close()
i updated the readeME
Thanks for your reply. The code still does not work for me. I think the problem is that the big json contains only one line, and using the code 'for line in f', it will load the whole file into the memory, and stuck
filtered_web_and_generated_captions_with_indent4.json has 160M lines. Have you downloaded the new json file? I have successfully printed each line.
filtered_web_and_generated_captions_with_indent4.json has 160M lines. Have you downloaded the new json file? I have successfully printed each line.
I downloaded the json file. Can not print every line
is there any erro information?
is there any erro information?
no. And the md5 of this new json file is 'ebd1c12c4abb97805885ac86b4d49a0f'
And I using the code below to calculate the line of the big json, the output is 1 "num = 0 with open(input_file,'r',encoding='utf-8') as f_reader: for line in f_reader: num+=1 print(num) "
Yes,you are right, the json file has only one line. It seems that it automatically changed from multiple lines too jsut one line. and my server has a huge RAM ,so i didn't find it.
I think you can save the label information to a big text file. And every line of the text file is a json string that contains a image and its label
A new TXT file is released. This time the file has a much smaller size and can be read line by line.
filtered_web_generated_captions.json is toooo large, 30GB.
My code is
df = pd.read_json('filtered_web_generated_captions.json', orient='index')
Then,
Segmentation fault (core dumped)
Can you provide a parquet or arrow version? or a huggingface dataset version of this?