Open lfcnassif opened 3 years ago
I think that using just the the glob patterns of mimetypes without signatures as fall back is a great idea. However, I think that a parameter of the glob patterns to restrict its use as a fallback would be better. In this case, if the user thinks that a given mimetype, like JPG, cannot be used if the signature does not match it put an extra attribute in the pattern definition. Finally, files without content could be treated in a different way, as they will not match any mimetypes because the content is not present. On these files maybe a good the maintain the actual behavior.
@gabrieldmf I would like your opinion on this, do you have some?
Hey @lfcnassif and @hauck-jvsh , I have a few thoughts on this.
Just some context. While working on #305, I was trying to figure out why the WinXTimelineParser was being called for some items, even though the SQLiteContainerDetector was returning an default sqlite as result, not an specific one. I then found, as @lfcnassif pointed, that it all comes down to MimeTypes detector class, which is the last one in execution order (DefaultDetector).
MimeTypes does 3 types of detection:
It uses the last two to "refine" the possible matches from signature detection. In case it does not find a signature match, it uses name/metadata detection as fallback, which explains the redirecting to the WinXTimelineParser.
I thought about a few approaches to handle that (as you guys also mentioned), and most came down to make some additional verification or minor changes in MimeTypes class.
After some look arround, I found it is fairly doable to add a "custom MimeTypes", based on the functionality and structure of the existing one. I think that might be a good starting point, as it will provide more flexibility in the detection flow.
I was already able to create a custom MimeTypes detector for testing (currently doing the same as the original). Tweaking it and adding some functionality should be fairly straight.
Let me know what you guys think, and I can work on that as we discuss.
Great, thank you @mbichara for the explanation I have not provided :). There are also some details about xml and text detection, but the general workflow is what you described.
As I said, my main concern is about mimetypes that could have signatures, but no signature is mandatory. The most relevant is RFC822 files. Tika had many issues about detecting it without extension. We also had some issue in the past with WMV detection, but that was a missing signature, extension fallback helped in some cases.
I think adding some Set
Any thoughts @tc-wleite ?
@lfcnassif, not sure if I am following the details of this discussion.... Is the goal of the proposed changes to avoid items with completely invalid content to be classified by extension and later to be sent to a parser which won't be able to parse them anyway, wasting processing time and possibly causing errors? If that is the case, I think it makes sense. In such case, let's say a file with .JPG extension but no JPG signature, which type would it receive? Unknown?
Parsing errors are not my original concern, although they will decrease a lot, what is very good (decreasing a little bit the chance of infinite loops and OOMEs). Your last question is the original question here. Should a overwritten file with .jpg extension, but no jpg signature, be classified as jpg image or as unknown file (if it doesn't have any known signature) or as Text file (if it has plain text in content)?
So yes, this is the proposal here.
PS: Some months ago I fixed an OOME with VCardParser because it received a huge .vcard file with no vcard signature. Actually VCardParser was vulnerable and could crash with a huge file beginning with VCARD signature too.
Got it now @lfcnassif, thanks! Well, it is kind of controversial... Some people (I would say mostly advanced users) are used to the current behavior, but for most users moving these "garbage" items out of their current type would make more sense. So I support the proposed change.
In very specific situations, these items could still be located easily (querying by the desired extension). So it would be more a matter of get used to this new way of assigning file types.
A question: it is possible that for some file types, signature (or other method?) type detection fails, but somehow the parser is still able to parse (at least partially) the item?
A question: it is possible that for some file types, signature (or other method?) type detection fails, but somehow the parser is still able to parse (at least partially) the item?
I think it depends on how much of the original file was overwritten and on the parser impl, but generally I think it has little chance of success.
If we change current behaviour, it will be changed just for v4.0, that is scheduled to middle 2021.
The very general rule used by Tika is to check if the file has some of all configured signatures. If some signature matches, use glob pattern definitions to refine the mimetype to some subtype, if defined. So if you have some file with JPG signature but PDF extension, it will be identified as JPG.
If no known signature matches, all glob pattern definitions are used as a fallback rule, even if there is some signature defined for the mimetype. So if you have some zeroed file with JPG extension, it will be identified as JPG, even though all JPGs must have a JPG signature.
I thought many times about changing this to: if no signature matches, use just the the glob patterns of mimetypes without signatures as fall back. Not sure if default Tika detector (MimeTypes class) could be extended or if I will need to submit the proposal to upstream project (and not sure if they will agree).
But this can cause problems eventually. Some very specific mimetypes have no fixed signatures (may have some or no one, like EML). There are also some mimes where not all possible signatures are defined (WMV in the past) and extension fallback helped.
Finally, if we have some overwritten zeroed jpg file with some relevant name, is it good to classify it as image so the user will know the name of a past jpg? Or it should be classified as unknown, which it really is?
PS: Current behaviour causes a lot of parse exceptions, with completely overwritten deleted files being identified by extension and sent to some parser that expects some known structure, although they are completely corrupted, so the parser fails.