Groot for Large Data Sets

piercefreeman commented 8 years ago

I'm using Groot with a pretty large provided data set (~100,000 objects with relationships). Some entities only have ~100 objects but the larger ones have around 30,000. Right now it's taking a long time for the parsing to take place, which seems to be related to the -[NSManagedObject(Groot) grt_setRelationship:fromJSONDictionary:mergeChanges:error:]. Specifically in the existingObjectsWithJSONArray method for executeFetchRequest. Does anyone have suggestions to speed up this specific process, perhaps on the CoreData level?

aspcartman commented 8 years ago

For large datasets it's always recommended to do things the hard way: by hand. Universality of tools comes in price. Also you should consider using "background" contexts.

o15a3d4l11s2 commented 8 years ago

I am also interested in possible techniques for speeding up the persistence process. I tried resetting the context before persisting entities, but this did not affect the speed.

gonzalezreal commented 8 years ago

I think the performance problem resides on the structure of the data, rather on the amount of data. Of course this becomes more evident on large datasets.

One thing that affects performance when serializing from JSON is object uniquing, as it requires fetching data from the database before inserting.

If you take a look on how Groot is implemented, there are three serialization strategies:

Insert
Uniquing
Composite Uniquing

As you may guess, the first one is the most performant as it does not fetch from the database. If you know that there is no duplicate data in your data set, DO NOT set identityAttributes in your entity. This will make Groot use the Insert strategy.

Groot will pick the Uniquing strategy if the identityAttributes annotation has a single attribute, otherwise it will pick the Composite Uniquing strategy.

The Uniquing strategy requires one fetch for every array of JSON objects, whereas the Composite Uniquing strategy requires one fetch for every single JSON object (it is potentially the slowest of the three strategies).

I hope this sheds some light on the subject.

gonzalezreal / Groot

Groot for Large Data Sets #52