GEDCOM import is too slow

hartenthaler commented 1 year ago

I'm using GrampsHub (premium version). I was not able to import my GEDCOM file containing about 12,000 people and 4,500 families. There is no error message and I cannot find any logfile.

It was possible to import a smaller file, but it needed hours. The performance has to be improved.

DavidMStraub commented 1 year ago

Hi, thanks for the feedback!

I'm using GrampsHub (premium version). I was not able to import my GEDCOM file containing about 12,000 people and 4,500 families.

Currently, the Gramphub "Tribe" tier is limited to 10,000 people. This is a pretty arbitrary limit and if you're inteterested in trying out a larger tree and give me feedback, I would be happy to increase the limit. Let's discuss this specific issue via support@grampshub.freshdesk.com to not bother the other Gramps Web developers here :wink:

There is no error message and I cannot find any logfile.

When trying to import a file with more than the allowed number of people, Gramps Web API fails. In practice, it imports the GEDCOM first into an in-memory database, counts the number of people, and only continues if the number is within the limit.

However, that should lead to an error message being displayed, so we'll have to investigate why you didn't get one.

It was possible to import a smaller file, but it needed hours. The performance has to be improved.

Interesting. That's definitely not good. I suspect it's the creation of the search index that takes so long, not the actual import. If you are willing to help us track down the bottleneck, there are two things you could try:

Click "Refresh Search Index" in the admin settings and let us know how long this took (seconds, minutes, hours?)
Use Gramps Desktop and import the Gedcom into a new database. How long does this take?

emyoulation commented 1 year ago

David,

On Discourse, he noted that importing his particular GEDCOM dialect generated thousands of Notes with data about specific incompatible tags. (Probably lots of UUIDs and other custom data.)

All GEDCOM imports are probably going to have certain number of these. In the future, some will directly transfer into Gramps records. (Like in 5.2, the UUID has a designated spot.) So it does not make sense to delete the data.

But this data is typically not appropriate for non-Admin users or reports.

Perhaps you could tweak the GEDCOM import plugin for GrampsWeb to set the Notes generated for errors to be flagged as "Private". This would hide those Notes for non-admin users and Reports.

hartenthaler commented 1 year ago

However, that should lead to an error message being displayed, so we'll have to investigate why you didn't get one.

maybe my fault. I reopend the window several hours after starting the import process, maybe I overlooked the error message.

Click "Refresh Search Index" in the admin settings and let us know how long this took (seconds, minutes, hours?)

more than 4 hours

Use Gramps Desktop and import the Gedcom into a new database. How long does this take?

about 3 minutes (not fast, but acceptable)

This import in the Gramps desktop program generated 74.000 error message notes. Mainly based on a custom tag in CHAN indicating the last user who changed that record. And CHAN is part of nearly all records in my GEDCOM file. There are some _UID and many _LOC records, too.

btw: does Gramps Web store the user in the CHAN record using a custom GEDCOM tag?

DavidMStraub commented 1 year ago

Thanks, that's really interesting. Could you do me a favour and click on "recreate search index" in the admin settings? I can then check in the logs how long that took. Since your desktop import is much faster, I suspect it's the search indexing that takes so long (Gramps Web first imports the file and then indexes the database to enable full text search).

hartenthaler commented 1 year ago

I started the rebuild of the search index just now.

DavidMStraub commented 1 year ago

Thanks! Should be this one:

[2023-11-02 09:30:05,630: INFO/ForkPoolWorker-2] Task gramps_webapi.api.tasks.search_reindex_full succeeded in 1603.6000406891108s: None

So it took half an hour.

Your import might have been this one

[2023-10-30 06:49:40,236: INFO/ForkPoolWorker-2] Task gramps_webapi.api.tasks.import_file succeeded in 8662.508026679978s: None

so 2.5 hours (including the indexing).

So it looks like the import (without the indexing) already took 2 hours, which is quite disturbing, but a good hint to start debugging. I suspect it could be due to inefficient database calls on web vs. desktop like in #433.

hartenthaler commented 1 year ago

Does the time needed for reindexing depend on the fact that the index is already fine? I did yesterday a reindexing and made no changes in the meantime. So maybe the initial indexing needs more time than a reindexing without any changes. But you know how it is implemented.

About database performance: the desktop Gramps uses sqlite. Gramps Web uses MySQL. Correct? Do you believe that the performance of sqlite on my laptop is so much better than MySQL on your powerful server?

Webtrees has optimized the performance of GEDCOM import in several steps in the last years. This is written in PHP, but if you are interested in how it works in webtrees, I can point you to the relevant file.

hartenthaler commented 1 year ago

... thinking again about what you said in the referenced issue: you are right, maybe the number of database calls is relevant.

DavidMStraub commented 1 year ago

Does the time needed for reindexing depend on the fact that the index is already fine? I did yesterday a reindexing and made no changes in the meantime. So maybe the initial indexing needs more time than a reindexing without any changes. But you know how it is implemented.

The button POSTs to the /api/search/index/?full=1 endpoint, so this recreates the whole index. An incremental reindex is done everytime something is changed. The point of the button is just to have a fallback in case something is corrupted for some reason (which shouldn't happen) or after an upgrade, in case the search index format changed. So the time needed should be representative of the first indexing.

... thinking again about what you said in the referenced issue: you are right, maybe the number of database calls is relevant.

Indeed, that's the only possible difference. Gramps Web can use SQLite or PostgreSQL, Grampshub uses PostgreSQL since this is the only way to go for larger number of users, but can lead to a very tiny network delay due to the server-client architecture rather than a local file, which normally doesn't matter, unless a function loops over tens of thousands of database calls.

gramps-project / gramps-web-api

GEDCOM import is too slow #436