raleighlittles / SMS-backup-and-restore-extractor

A simple Python script for extracting images out of an "SMS Backup & Restore" backup.
https://play.google.com/store/apps/details?id=com.riteshsahu.SMSBackupRestore&hl=en_US
10 stars 1 forks source link

Script only extracts a small portion of images from large (3 GB) file #4

Closed malexherron closed 1 month ago

malexherron commented 1 month ago

Using this with a 3.7GB smsBackupAndRestore file and it is only restoring 32 images. There should be hundreds as verified by loading the file into https://www.synctech.com.au/sms-backup-restore/view-backup/

i am using ubuntu 20.04 with Python 3.8.10 and LXML 5.2.2. Here is the command used and the output:

08:11 [alex@kite] ~/sms ┤ pip3 show lxml | grep Version
Version: 5.2.2

09:43 [alex@kite] ~/sms ┤ python3 --version
Python 3.8.10

09:44 [alex@kite] ~/sms ┤ ls
backupExtractor  output  sms-20240709093820.xml

10:31 [alex@kite] ~/sms ┤ python3 backupExtractor/backup_extractor.py -i /home/alex/sms -t sms -o /home/alex/sms/output
32 files created... Automatically removing duplicates
0 files removed

any insight as to how to troubleshoot this would be appreciated. thanks for the wonderful tool!

raleighlittles commented 1 month ago

Huh, that is weird. Can you run these commands:

$ grep -inr ct=\"image/png\" sms*.xml | wc -l
$ grep -inr ct=\"image/jpeg\" sms*.xml | wc -l 

The script basically "finds" images by looking for those blocks in XML

so this means an image is included in the message that follows

image

I want to check if it's first finding all of the images that are supposed to be there. @malexherron


Also, thank you for telling me about that tool, I had no idea that existed -- that would've been very useful to know about when I was working on this, and I will add that to the README

malexherron commented 1 month ago

happy to help @raleighlittles, and thanks for the response!

looks like grep finds many instances of image/png and /jpeg.

19:11 [alex@kite] ~/sms ┤ grep -inr ct=\"image/png\" sms*.xml | wc -l
453

19:11 [alex@kite] ~/sms ┤ grep -inr ct=\"image/jpeg\" sms*.xml | wc -l
3588
raleighlittles commented 1 month ago

Okay so it finds the images.. Is there a data field in there then? Like in the screenshot I showed earlier, there is a "data=" and then the base-64 encoded image follows. Can you do:

$ grep -inr ct=\"image/jpeg\" sms*.xml | head -n1 | less 

and then take a screenshot of the first two lines?

image

I want to see that the data and "cl" fields are there. @malexherron

malexherron commented 1 month ago

yeah looks like those fields are present

zoc_2024 07 14-12 30 53_3422x56

123887:      <part seq="0" ct="image/jpeg" name="null" chset="null" cd="null" fn="null" cid="&lt;image000000_6599.jpg&gt;" cl="image000000_6599.jpg" ctt_s="null" ctt_t="null" text="null" sub_id="-1" data="/9j/4AAUSkZJRgABAQEBLAEsAABBTVBG/+EJzEV4aWYAAE1NACoAAAAIAA4BDwACAAAABgAAALYBEAACAAAADgAAALwBEgADAAAAAQABAAABGgAFAAAAAQAAAMoBGwAFAAAAAQAAANIBKAADAAAAAQACAAABMQACAAAABQAAANoBMgACAAAAFAAAAOABPAACAAAADgAAAPQBQgAEAAAAAQAAAgABQwAEAAAAAQAAAgACEwADAAAAAQABAAC

i also did the following grep to look for image/jpeg and non-empty strings in the cl and data fields, and it matched the count above of image/jpeg, so i don't think there's an issue with empty data

12:19 [alex@kite] ~/sms ┤ grep -E 'ct="image/jpeg".*cl="[^"]+".*data="[^"]+"' sms*.xml | wc -l
3588
raleighlittles commented 1 month ago

Hmm, I'm puzzled now. I can't imagine why the xpath result doesn't match the actual number that's in the file.

Our files have the same schema -- a part element, a ct (mime type), a data section.

Can you try getting rid of "recover=True" on this line https://github.com/raleighlittles/SMS-backup-and-restore-extractor/blob/a0d940a7aaac7add3c090b8341285b5eb2a162b0/mms_images_extractor.py#L16

and replace it with "huge_tree=True"

The only thing I can think of is that maybe LXML isn't parsing the entire tree due to the large file size or something like that. @malexherron

malexherron commented 1 month ago

@raleighlittles that change had the desired effect of outputting over 4000 images. Thanks so much for the help, even with the troubleshooting this tool still saved me loads of time lol!

raleighlittles commented 1 month ago

@malexherron Glad that fixed it! I really wonder why LXML has that option in the first place, I can't imagine this is the first time it's bitten someone.