jjjake / internetarchive

A Python and Command-Line Interface to Archive.org
GNU Affero General Public License v3.0
1.61k stars 218 forks source link

Way to skip PDFs that cause `Syntax error` #651

Open wcedmisten opened 5 months ago

wcedmisten commented 5 months ago

Hello!

Not sure if there is any workaround for this currently, but I'm trying to bulk upload a set of ~70,000 PDFs using ia upload. The problem is that I periodically get the error:

Uploaded content is unacceptable. - Syntax error detected in pdf data. You may be able to repair the pdf file with a repair tool, pdftk is one such tool.

Which returns an error. I then have to manually delete the PDF and run the command again to resume uploading. Is there a way to automatically skip the PDFs that throw the error so that manual intervention is not required?

jjjake commented 5 months ago

There is not currently a way to skip failed uploads and continue uploading other files specified in the command (I support this feature though, if anybody has time to add it).

I would suggest finding and filtering any invalid PDFs before uploading:

» find my_pdf_dir -type f | parallel 'pdfinfo -- {} >/dev/null 2>&1 || echo invalid pdf: {}'
wcedmisten commented 4 months ago

Thanks! For posterity that command didn't output anything, even though pdfinfo on an individual bad file was outputting correctly. I ended up writing a non-paralellized version:

for f in $(ls);
  do
  if pdfinfo $f 2>&1 >/dev/null | grep 'Syntax';
    then echo 'Error on '$f;
  fi;
done