github-linguist / linguist

Language Savant. If your repository's language is being reported incorrectly, send us a pull request!
MIT License
12.26k stars 4.24k forks source link

Incorrect mime-type reported for Adobe Illustrator files #4572

Closed Alhadis closed 5 years ago

Alhadis commented 5 years ago

Linguist reports a content-type of application/postscript for *.ai files, instead of the correct application/pdf:

λ bundle exec rake bin/github-linguist test/fixtures/Binary/octocat.ai
octocat.ai: 0 lines (0 sloc)
  type:      Binary
  mime type: application/postscript
  Language:

I can tell one of the mime_* gems Linguist's using is to blame. But the reason I'm reporting this here (as well why I care how Linguist classifies a binary file) is that I'm wondering if fixing the bogus mime-type would enable AI files to be rendered as PDFs on GitHub.

I recall reporting this to site support at least twice now, but I'm wondering if it could be handled on our end after stumbling across this:

https://github.com/github/linguist/blob/001ca526694a32a5ac4c977a79032ced0b8c0aca/test/test_blob.rb#L31-L32

Adobe Illustrator files are simply PDFs with extra (editor-specific) metadata attached. There's otherwise nothing that sets it apart genetically from a run-of-the-mill PDF file:

λ file --mime test/fixtures/Binary/octocat.ai
octocat.ai: application/pdf; charset=binary

Here's proof in a branch I just pushed: All I did was change the .ai extension to .pdf:

Figure 1

Hey presto, suddenly Illustrator artwork can be viewed on GitHub.com! (BTW, PostScript isn't binary. :) It's 7-bit clean US-ASCII, about as plain as plain-text can possibly get. 😉

Template removed as it doesn't really apply.

lildude commented 5 years ago

Adobe Illustrator files are simply PDFs with extra (editor-specific) metadata attached. There's otherwise nothing that sets it apart genetically from a run-f-the-mill PDF file:

Not quite according to the Wikipedia description for Adobe Illustrator Artwork files as it says the content can be PDF or EPS.

This sentence in particular:

Early versions of the AI file format are true EPS files with a restricted, compact syntax, with additional semantics represented by Illustrator-specific DSC comments that conform to DSC's Open Structuring Conventions.

... probably explains the current classification as application/postscript.

A quick search shows there are a lot of instances of both EPS/Postscript and PDF .ai files with the PDF files being very much in the minority, though that's probably because they may be detected as binary.

Having the gem differentiate between the two may cover both scenarios and get you what you want for the PDF ai files without affecting the EPS/Postscript files.

Alhadis commented 5 years ago

Any idea which gem to submit a pull-request to? There's a few MIME gems in use.

Not quite according to the Wikipedia description for Adobe Illustrator Artwork files as it says the content can be PDF or EPS.

Ah right, I forgot about legacy formats. Yes, Illustrator versions 3 through to 8 saved artwork as specialised EPS files, with later versions being PDF-based (here're the save options to give you a sense of scale):

Figure 1

The headers of the EPS-based AI files are conspicuous, so disambiguating should be easy:

Illustrator 3

%!PS-Adobe-3.0 
%%Creator: Adobe Illustrator(TM) 3.2
%%AI8_CreatorVersion: 23.0.4
%%For: (John) ()
%%Title: (Untitled-1.ai)
%%CreationDate: 7/7/19 7:16 pm
%%Canvassize: 16383
%%BoundingBox: 71 -619 570 -119
%%DocumentProcessColors: Cyan Magenta Yellow Black

Illustrator 8

%!PS-Adobe-3.0 
%%Creator: Adobe Illustrator(R) 8.0
%%AI8_CreatorVersion: 23.0.4
%%For: (John) ()
%%Title: (Untitled-2.ai)
%%CreationDate: 7/7/19 7:17 pm
%%Canvassize: 16383
%%BoundingBox: 71 -619 570 -119
%%HiResBoundingBox: 71.0675 -618.185 569.5992 -119.6533
%%DocumentProcessColors: Cyan Magenta Yellow Black

Illustrator 9 — present day

%PDF-1.4
%‚„œ”
1 0 obj
<</Metadata 2 0 R/Pages 3 0 R/Type/Catalog>>
endobj
2 0 obj
<</Length 3102/Subtype/XML/Type/Metadata>>stream
<?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.6-c145 79.163499, 2018/08/13-16:40:22        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:xmp="http://ns.adobe.com/xap/1.0/"
            xmlns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"
            xmlns:dc="http://purl.org/dc/elements/1.1/"
            xmlns:pdf="http://ns.adobe.com/pdf/1.3/">

though that's probably because they may be detected as binary.

It's possible for a PDF to be uncompressed ASCII, but that logically inflates the filesize, so it's rarely done except for pedagogy or debugging.

lildude commented 5 years ago

Any idea which gem to submit a pull-request to?

https://github.com/mime-types/mime-types-data I think as this is where MiniMime draws from.

Alhadis commented 5 years ago

Thanks! 👍 I've opened an issue there.

Apparently Illustrator 9 was released in 2000, making EPS-based AI formats over 20 years old... 😕 So it's probably worth doing away with the header-sniffing altogether and just assuming PDF is used.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.