intel-analytics / analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
https://analytics-zoo.readthedocs.io/
Apache License 2.0
18 stars 4 forks source link

Corrupt record when reading Json file for DIEN meta_Books.json #106

Closed yizerozhuang closed 2 years ago

yizerozhuang commented 3 years ago

When reading a json file call meta_book.json, some of the lines corrupt during reading the json file by read_json() in friesian. It returns null for all the columns, so there are rows remove after drop na opration which effect the final result. Ai-matrix and recdp groups convert json file to csv file first, it doesn't remove any rows and work fine in spark. The 14th record in the following figure is one of the records that corrupt during reading. image

One workaround is to convert to csv first (like what ai-matrix does), though the conversion may take extra time.

hkvision commented 2 years ago

Fixed. Changed to csv or using read_text and use a udf to parse each row.