neRok00 / ancestry-image-downloader

A Python script that downloads images from Ancestry.com that are related to records in your family tree.
20 stars 3 forks source link

Ancestry Germany #1

Closed m-weinand closed 7 years ago

m-weinand commented 7 years ago

I totally understand your last line of the FAQ "The script outputs helpful errors, which you can review and try to solve yourself.", and that you don't want to provide support. Unfortunately, I don't understand how to modify this script for Ancestry Germany (no, it's not just replacing all instances of 'ancestry.com' with 'ancestry.de'), nor do I understand the error message of the script when aborting before downloading any image: Processing APID 1 of 28 <APID 1,60504::1117084>...

Getting the record page for the APID... Processing the record page to determine the image ID... The record does not have an image. Writing results to CSV file... Traceback (most recent call last): File "D:\ancestry_image_downloader.py", line 412, in run(gedcom=GEDCOM_FILE, username=USERNAME, password=PASSWORD, output_directory=OUTPUT_DIRECTORY) File "D:\ancestry_image_downloader.py", line 396, in run problem_apids = process_apids(apid_matches, session=session, csv_writer=csv_writer, logger=logger) File "D:\ancestry_image_downloader.py", line 249, in process_apids processed_iids[iid].apids.append(apid) UnboundLocalError: local variable 'iid' referenced before assignment

If there is anyone out there to push me into the right direction...

neRok00 commented 7 years ago

All good. In this instance, you have actually found a bug in the script, so I will look into it and push an update for that soon.

Regarding the German Ancestry, I don't think you have to change anything. It seems Ancestry uses common databases and record numbers for everything, they just limit what you can access with certain subscriptions. So theoretically, your Germany specifc records should still be accessible and downloadable via Ancestry.com. Indeed, this was the case when I tested with an Australian (Ancestry.com.au) account. It might be different though, so perhaps you can report back after I fix this bug.

m-weinand commented 7 years ago

Great, thank you. I will report back as soon as the script has been updated.

neRok00 commented 7 years ago

It has been updated in a45bee4a8135abe295b30883ea27a2a74049f3c3. I haven't been able to test it myself, but it looks good.

PS - Make sure you download the script again.

m-weinand commented 7 years ago

No more error message(s), but also no donwloaded images, though the apid refers to the correct collection (http://search.ancestry.com/cgi-bin/sse.dll?indiv=1&dbid=60504&h=1117084)... Attempting to login to Ancestry.com...

Login successful. Creating output folder and files... Output files and folders created. Begin processing the APID's and images... Processing APID 1 of 28 <APID 1,60504::1117084>...

Getting the record page for the APID... Processing the record page to determine the image ID... The record does not have an image. Writing results to CSV file... Finished! Processing APID 2 of 28 <APID 1,60504::1117084>... APID previously processed as part of another source. Finished! All APID's processed. There were errors with 0 APIDs. Closing files... Finished!

neRok00 commented 7 years ago

Yer, now we are moving into the area that's going to be hard to investigate, because I can't access the record page for that german record at all (albeit, I don't have an active subscription at the moment at all). I presume all 28 APID are the same comments repeated - no other variants?

If you open and login to ancestry.com in whatever browser you usually use, then visit the URL you pasted above, is there an image on that page? If so, can you paste the HTML of that page here (either save the page and upload the html file, or right click > view page source and copy the contents here). Then I can attempt to determine why it hasn't found the image.

m-weinand commented 7 years ago

The process terminates after the above mentioned 2 apids which are identical because they contain 2 dates for one person in one image. Upon manipulating the gedcom file (just deleting the one person with apid 1) I get an error message that the gedcom file cannot be verified as a gedcom file.

[Btw: all APIDs in my gedcom are number '1']...

When I enter the above URL with .de instad of .com I get the following page: 01

When I click the green button named 'Anzeigen' (German for 'show (image)') the picture opens in the viewer and it looks like this: 02

The URL of this second page is http://interactive.ancestry.de/60504/42683_332%5E5%5E%5E3175-00181?pid=1117084&backurl=http%3a%2f%2fsearch.Ancestry.de%2f%2fcgi-bin%2fsse.dll%3findiv%3d1%26dbid%3d60504%26h%3d1117084&treeid=&personid=&hintid=&usePUB=true

neRok00 commented 7 years ago

Okay, the problem seems to be special characters in the URL (notice the URL you pasted has %5E, whilst the screenshot shows ^ character in those positiions).

If you could test for me, edit the script and change line number 194 from iid_regex = re.compile(r"var iid='([\w\d_-]+)';") to iid_regex = re.compile(r"var iid='(\S+)';"), and report back. This should let the script read the URL, but there might be another problem. We shall find out...

m-weinand commented 7 years ago

No changes, same log messages copied from shell window: Agree Attempting to login to Ancestry.com... Login successful. Creating output folder and files... Output files and folders created. Begin processing the APID's and images... Processing APID 1 of 28 <APID 1,60504::1117084>...

Getting the record page for the APID... Processing the record page to determine the image ID... The record does not have an image. Writing results to CSV file... Finished! Processing APID 2 of 28 <APID 1,60504::1117084>... APID previously processed as part of another source. Finished! All APID's processed. There were errors with 0 APIDs. Closing files... Finished!

Please support this script creators efforts by donating via Paypal at the following link; http://http://neRok00.github.io/ancestry-image-downloader

m-weinand commented 7 years ago

I m off for now, I get back to this later. Thanks

m-weinand commented 7 years ago

Maybe we should stop here. Thanks for your help! I can easily download those 28 images one by one, no need for a script. I thought I would be able to dive into the code and modify the script so I could download whole folders like a volume of a church book. But modifying the gedcom file in any way, that is: even the slightest changes in the gedcom file result in an error message from the shell, otherwise it should be easy to just fill the gedcom file with APIDs: "Validating gedcom file... The following problem was encountered when validating the file; The file cannot be verified as a gedcom file, as it does not have a header section. Aborting."

Thanks again for your patience :)

neRok00 commented 7 years ago

so I could download whole folders like a volume of a church book

I wouldn't do that. Ancestry and the content owners let you download the images that this script would download, but using the script is against the Ancestry service T&C's. However, if you start downloading whole volumes of data unrelated to your tree, not only is that not allowed by the Ancestry service, it is not allowed by the content owners (and we are potentially talking government departments here). You would potentially be opening yourself up to copyright infringements and all sorts of things.

But modifying the gedcom file in any way, that is: even the slightest changes in the gedcom file result in an error message from the shell

The script is just using a simple regex search to find the HEAD section of the gedcom. Chances are if it is failing when you are removing/changing some other line, the editor you are using is probably changing the file in some other way, like converting it's character encoding or line endings or something of that nature, and the regex search is failing because of that.

otherwise it should be easy to just fill the gedcom file with APIDs

If you were going to download entire volumes, that would be a terrible way to go about it any way. A single page from something like an electoral roll could have 100 people on it, and that means 100 records/apid's. The script would spend most of its time redundantly checking every APID.

Maybe we should stop here.

I still want to get to the bottom of why it isn't downloading the images for the 28 records. Like I said previously, if you could go back to the page in your first screen shot, and right click in a blank area and click 'view page source' or 'view source' (depending on the program you are using and translation etc), then copy and paste the source here, I can suss it out. In particular I am looking for a line in the code along the lines of var iid='42683_332%5E5%5E%5E3175-00181';". That is the line that the script searches the code for to determine the image ID. It must have some characters or similar that the search pattern isn't picking up.

m-weinand commented 7 years ago

Alright,` so here's the HTML code of the page from the first screenshot. Its URL is http://search.ancestry.# de/cgi-bin/sse.dll?indiv=1&dbid=60504&h=1117084

<!DOCTYPE html>

Hamburg, Deutschland, Heiratsregister, 1874-1920 - Ancestry.de ``` ``` ```

Ancestry

```
Bild aufzeichnen Anzeigen

    Feld muss ausgefüllt werden

    Bitte wählen Sie einen Grund für Ihre Alternative

    Ihre Alternative wurde gespeichert

    Dieser Aufzeichnung wurde eine Information durch ein anderes Ancestry-Mitglied hinzugefügt. Wenn Sie etwas ergänzen oder korrigieren möchten, klicken Sie bitte auf „Eigene Info hinzufügen“ und teilen Sie uns und anderen Nutzern mit, was Sie wissen.

    Name: Johann Albert Weinand
    Geschlecht: männlich
    Alter: 28
    Geburtsdatum: 23. Jan 1883
    Heiratsdatum: 21. Feb 1911
    Heiratsort: Hamburg, Hamburg, Deutschland
    Standesamt: Hamburg 02a
    Ehepartner: Frieda Bertha Ernstine Larsen
    Urkunde Nummer: 88
    Signatur: 332-5_3175

    Quelleninformationen

    Ancestry.com. Hamburg, Deutschland, Heiratsregister, 1874-1920 [database on-line]. Provo, UT, USA: Ancestry.com Operations, Inc., 2015.

    Ursprüngliche Daten: Best. 332-5 Standesämter, Personenstandsregister, Sterberegister, 1876-1950, Staatsarchiv Hamburg, Hamburg, Deutschland.

    Staatsarchiv Hamburg

    Beschreibung

    Diese Sammlung enthält Heiratsregister aus Hamburg und umfasst die Jahre von 1874 bis einschließlich 1920. Erfahren Sie mehr...

    ``` ```
    neRok00 commented 7 years ago

    Wow, that didn't work well! You need to indent code like that with 4 spaces for it render correctly.

    Regardless, I determined from that html that the IID is written as var iid='42683_332^5^^3175-00181';, which means the updated regex I suggested you try in a previous post should have worked. Indeed my own 'non-live' tests show it should have worked.

    I have updated the script again (see 8778ad953855c3f43a275ca9cdb2a37fa4207cdf) with a slightly different regex to any of the above, and also put in a better log message for when the image isn't found. You might like to try this new script (I have now added a version number to the top of the script for tracking), but if it didn't work before then it probably won't work now. Unfortunately I won't be able to help with this issue any further, because it is too difficult fault-finding considering I don't have any active subscription, particularly not a german/world subscription.

    m-weinand commented 7 years ago

    Sorry for the mess with the code, I didn't know how to handle this (I'm totally new to github). Anyway, the new script worked partially - it downloaded at least the first image and quit after the second because it noticed that it had already processed the same APID.

    I can totally understand that you don't want to look further into this, but here's the log of the shell:

    Begin processing the APID's and images... Processing APID 1 of 28 <APID 1,60504::1117084>...

    Getting the record page for the APID... Processing the record page to determine the image ID... Get information regarding the image... Processing the image information... Downloading image... Saving image... Image file saved successfully. Writing results to CSV file... Finished! Processing APID 2 of 28 <APID 1,60504::1117084>... APID previously processed as part of another source. Finished! All APID's processed. There were errors with 0 APIDs. Closing files... Finished!

    neRok00 commented 7 years ago

    All good, it was the website side I couldn't investigate further.

    This latest issue though (only processing 2) has me stumped. Nothing in the script looks out of place. Are you willing to upload the gedcom here, or perhaps email it to me, so I can test it and see exactly what is happening?

    Scrap that, found the problem, new script incoming.

    neRok00 commented 7 years ago

    Fix made 7560b5d2a233895000201225989e38c4082086d8. Hopefully it all works now. Please download again and report back.

    m-weinand commented 7 years ago

    Running perfectly now! Thanks a lot.