sindresorhus / file-type

Detect the file type of a file, stream, or data
MIT License
3.68k stars 348 forks source link

PDF created with Adobe Illustrator are wrongly detected as .ai files #360

Closed fungiboletus closed 3 years ago

fungiboletus commented 4 years ago

Since #323 (src: Add support for AI files (Adobe Illustrator)), file-type looks for the text "Adobe Illustrator" in PDF documents and if it matches, it assumes it's an adobe .ai file.

It seems that normal PDF created with Adobe Illustrator will contain the text "Adobe Illustrator" quite a few times in the metadata too, even though they are not Adobe Illustrator files.

sindresorhus commented 4 years ago

// @vladfrangu

vladfrangu commented 4 years ago

Can you send me an AI PDF file please? You can attach it here, email it or send it on Discord if it's easier for you (Vladdy#0002)

I'll look into it as soon as I can! 😅

fungiboletus commented 4 years ago

I installed Adobe Illustrator's trial and made a few test documents:

issue_360_filetype.zip

vladfrangu commented 4 years ago

Well it's good to know out of 6 cases, only one fails 😅 I'll look into it asap and let you know!

fungiboletus commented 4 years ago

True, but it's the one with the default settings for PDF in Adobe Illustrator.

vladfrangu commented 4 years ago

Hey! Sorry to keep you in the dark for 8 whole days, just shot a quick eye at the text using a text diff viewer. Running a diff between the fixture.ai file present in the fixtures folder on this repository, and the adobe-illustrator.pdf file from your archive yielded... A big middle finger from the metadata! However, I thing I spotted in the PDF is that there's this section of data:

Data slice ``` 20 0 obj <> endobj 21 0 obj <>stream %!PS-Adobe-3.0 %%Creator: Adobe Illustrator(R) 24.0 %%AI8_CreatorVersion: 24.1.2 %%For: (Antoine Pultier) () %%Title: (test-no-pdf.ai) %%CreationDate: 5/13/2020 1:22 PM %%Canvassize: 16383 %%BoundingBox: 145 -255 407 -166 %%HiResBoundingBox: 145.537109375 -254.33642578125 406.0888671875 -166.9560546875 %%DocumentProcessColors: Black %AI5_FileFormat 14.0 %AI12_BuildNumber: 408 %AI3_ColorUsage: Color %AI7_ImageSettings: 0 %%CMYKProcessColor: 1 1 1 1 ([Registration]) %AI3_Cropmarks: 0 -841.8897637795 595.2755905512 0 %AI3_TemplateBox: 298.5 -421.5 298.5 -421.5 %AI3_TileBox: 27.6377952756002 -780.944881889751 567.637795275601 -60.9448818897499 %AI3_DocumentPreview: None %AI5_ArtSize: 14400 14400 %AI5_RulerUnits: 1 %AI9_ColorModel: 2 %AI5_ArtFlags: 0 0 0 1 0 0 1 0 0 %AI5_TargetResolution: 800 %AI5_NumLayers: 1 %AI9_OpenToView: -381 23 1.13 1542 988 18 0 0 120 87 0 0 0 1 1 0 1 1 0 0 %AI5_OpenViewLayers: 7 %%PageOrigin:-8 -817 %AI7_GridSettings: 72 8 72 8 1 0 0.800000011920929 0.800000011920929 0.800000011920929 0.899999976158142 0.899999976158142 0.899999976158142 %AI9_Flatten: 1 %AI12_CMSettings: 00.MS %%EndComments endstream endobj ```

Technically, this can be used to detect if this is, in the end, a PDF file. However, I don't know how many cans of worms this will also open up, as I'm not an active user of Adobe products. I can, however, attempt to implement a PR for this!

CSoellinger-IDS commented 4 years ago

Any chance to get this fixed? Cause the console version from "file-type" is getting the correct file-type (it's using npm file-type v12.xx i think). I'm using file-type as upload validator so my only two options are i am allowing AI file types too or waiting for a bugfix for this :)

Rechnung_400006880095.pdf.zip

vladfrangu commented 4 years ago

It's fixable but I have to mess around with it a lot cause of the way PDF files exist... Basically:

CSoellinger-IDS commented 4 years ago

Ok, so for now i "just" also accept AI files and hope it will be fixed anytime :) I am not familiar with the code from the file-type package, but maybe i get some time to check this problem too... based on your three steps :)

However, will be cool to get this fixed :)

thekiwi commented 4 years ago

We've also encountered this same regression. We rely on the PDF detection functionality to validate specific PDF processing requests but as of file-type@14.1.0 this process breaks as file-type returns the files as .ai (application/postscript).

Is there a solution in mind here? I'd suggest it's more of a 'bug' than an 'enhancement' as it is falsely identifying one file type as another.

For now, we've pinned to 14.0.0 until this is resolved. We'll also look at submitting a PR if we can determine a nice fix on our side.

cmcgrath13 commented 2 years ago

This appears to be back. I am currently using file-type@17.1.1 and PDFs exported from illustrator with similar parameters to @fungiboletus and it is improperly detecting it as a .ai file. @vladfrangu

vladfrangu commented 2 years ago

Well the parsing was changed in #396 from what I did so I don't really know what the issue is. Best thing you can probably do is attach a file sample with the broken detection and someone will hopefully take a look

cmcgrath13 commented 2 years ago

Well the parsing was changed in #396 from what I did so I don't really know what the issue is. Best thing you can probably do is attach a file sample with the broken detection and someone will hopefully take a look

@vladfrangu I can DM someone the file for testing, but would prefer to not share it in a public setting. Where should I send this?

vladfrangu commented 2 years ago

Could you replicate the pdf with non-sensitive information? (also helps since it can be added as a text fixture in the repo)

cmcgrath13 commented 2 years ago

Could you replicate the pdf with non-sensitive information? (also helps since it can be added as a text fixture in the repo)

Sure, let me generate something