mrkrstphr / php-gedcom

UNMAINTAINED - A library for reading and writing GEDCOM files in PHP.
MIT License
38 stars 46 forks source link

Dealing with low memory and/or large files #10

Open stuporglue opened 11 years ago

stuporglue commented 11 years ago

Kristopher,

How big of a file should php-gedcom be able to handle? It loads the whole GEDCOM into memory, doesn't it?

I've got a GEDCOM with about 19,000 people (9.4M) and php-gedcom runs out of memory. PHP's memory was at 128M. Once I set it to 256M the file was successfully parsed, but it took about 18 seconds to do so.

Would it be possible to implement a less memory intensive way of parsing through the file?

I think in a lot of cases the entire GEDCOM isn't needed anyways, so if there were a way to avoid loading all of the objects into memory during object initialization it could speed up access to the parts of the file the user does want.

You know your code way better than I do, of course, so you probably have some better ideas, but here's are the two initial ideas that came to mind:

1) Don't actually parse the file until Gedcom.php's getIndi (or getFam, etc.) is called. At that point the whole file would be read through, but only Indi objects would be created and returnd.

Upside: Only the objects the user requests are created

Downside: Reading through the file multiple times if multiple functions are called (eg. getFam is called after getIndi).

2) Read through the file once initially, and make an array of line numbers or file positions where each object type occurs, but don't initialize any objects until they are requested. When getIndi or whatever is called that array of line numbers is used to seek to the object.

Upside: Only the objects the user requests are created, slightly less time reading through the file

In either of those cases php-gedcom would still be creating 19,000 indi objects even if I only need one of them. Implementing either of the above options and also implmenting something like getOneIndi($indiId) (and corresponding methods for family, media, notes, etc) could greatly reduce memory usage for some common use cases.

Let me know any thoughts you might have on the matter. I'd be happy to put something like this on my TODO list, even if I won't get to it till the end of summer.

Thanks, Michael Moore / stuporglue

mrkrstphr commented 11 years ago

Michael, the largest gedcom file I have to work with is only 10,000 individuals, and php-gedcom seems to handle it fine. Is there anyway you could privatize your file and send it over to me for a testing example? Otherwise I could probably write a script to generate a large file or something.

stuporglue commented 11 years ago

Here you go http://stuporglue.org/familyhistory/download/14915_individuals.ged http://stuporglue.org/familyhistory/download/19436_individuals.ged

The test script I'm using: http://stuporglue.org/familyhistory/download/test.txt

------------- Testing PhpGedcom\Parser on 14915_individuals.ged ------------- Found 14915 individuals Found 10568 families Used 146MB memory. Took 18 seconds

------------- Testing PhpGedcom\Parser on 19436_individuals.ged ------------- Found 19436 individuals Found 5133 families Used 215MB memory. Took 22 seconds

For pages that show all info about a single person parsing all the data adds a lot of overhead.