hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.39k stars 158 forks source link

Support more filetypes / arbitrary file import #261

Open glael opened 4 years ago

glael commented 4 years ago

Some screenshots from the discord

Since PSD files are already allowed, there is probably no reason not to allow these: image image image Same, .wav files are not very different from mp3 files: image You get the idea: image

The actual issue

Since this gets requested a lot, it is probably worth discussing arbitrary file imports (i.e. allowing any file to be imported, without checking if it's "allowed" first): image This is potentially dangerous though, since the act of opening a file sometimes changes the hash of the file. (for example: when opening a .epub file with calibre, it adds/changes the META-INF/calibre-bookmarks.txt file inside the epub/zip. This would break hydrus' hash-based system.

So: please discuss: "should hydrus prevent users from importing such files?"

Zweibach commented 4 years ago

In my opinion Hydrus should allow all "supported" formats through with no problem but give an explicit warning about how things can go pearshaped with formats likely to be changed by being used on at least the first import of such. If things do go pearshaped afterwards then it's explicitly on the user's head. And probably also a warning for all formats it doesn't recognise or know how to group (see system:filetype and how it groups image formats together).

CuddleBear92 commented 4 years ago

Noting some 3D formats as people have requested it in the past:

.fbx .obj .daz .stl .3ds .mdl .sdl

There is many more formats for 3D use.... a 3D viewer is prob needed if one or more of these are added.

ShadowJonathan commented 4 years ago

Maybe MIME type could be tried to be inferred from file content/import details/etc, and users getting a warning/amend dialog if a MIME type cannot be asserted?

If the whole range (or at least a large subset) of standard MIME can be recognized and imported right off the bat, it'd instantly support a wide range, and later on make it easy to add more types.

imtbl commented 4 years ago

Maybe MIME type could be tried to be inferred from file content/import details/etc, and users getting a warning/amend dialog if a MIME type cannot be asserted?

If the whole range (or at least a large subset) of standard MIME can be recognized and imported right off the bat, it'd instantly support a wide range, and later on make it easy to add more types.

That's already the case, Hydrus uses magic numbers to determine file types. The limited support is more of an intended limitation atm and not really a technical one afaik.

walramb commented 4 years ago

This is potentially dangerous though, since the act of opening a file sometimes changes the hash of the file. (for example: when opening a .epub file with calibre, it adds/changes the META-INF/calibre-bookmarks.txt file inside the epub/zip. This would break hydrus' hash-based system.

I'm not familiar with that software in particular, but couldn't a lot of mangling of this sort be prevented by having hydrus set all its media file permissions to read-only?

bbappserver commented 4 years ago

My understanding of file identification is that it was currently relying on some mix of magic number filetype detection and ffmpeg. https://pypi.org/project/filetype/

For weird filetypes it is usually the responsibility of http to provide a Content-Type

I think marking file which could not be identified by either of these means just becoming application/octect-stream is fine, and you can manually adjust on your end. MIMEs basically exist because extensions are unreliable, but you could let the user whitelist the problematic extensions and assign them a mime, so they don't have to go in and manually reassign things.

pozieuto commented 4 years ago

One of those Discord screenshots implies that filetypes without visual media viewer support can't have thumbnails. I don't see why that would need to be the case. Some applications generate their own thumbnails for their own filetypes. For example, LibreOffice has a Windows Explorer extension that generates thumbnails for documents - could Hydrus just grab those thumbnails for those files?

Audio files could have thumbnails generated based on their waveform. ffmpeg could be used for this.

floogulinc commented 4 years ago

Hydrus does currently generate thumbnails for a file type it can't view, namely swf.

The issue of hashes changing is interesting. What is the behavior currently if I say edit an image form Hydrus in paint or Photoshop and save it. Does Hydrus recompute the hash or just keep thinking the file has its original hash?

imtbl commented 4 years ago

Does Hydrus recompute the hash or just keep thinking the file has its original hash?

The latter at first, but once file maintenance runs and Hydrus checks the hashes, it will think the file is invalid, see screenshot.

image

That means, editing files that have been added to Hydrus is currently not really a thing. Keeping track of potentially of millions of files changing is also not that trivial afaik. While Hydrus is running, you could watch for filesystem events. But if changes are made to a file while it's closed, Hydrus would need to potentially recheck every file on startup (which is not feasible I think).

Of course, you could do things like calculating hashes for groups of files and then checking those first (and if a hash is wrong, then check each file of the group, basically a chunked approach). I'm not really an expert on this. The way Git does it might also be feasible, but I don't know how well that would perform with this many (and potentially huge) binary files.

I proposed a simpler workaround in Discord that could also work, but surely isn't as ideal or as desirable:

image

bb010g commented 4 years ago

If the backing filesystem supports them, snapshots could be used, but that's a weird complexity. Implementation-wise, Windows's Volume Shadow Copy APIs can be not so fun too. (Directing VSS to only copy a specific directory isn't obvious. ZFS or btrfs should be far easier, though.) https://docs.microsoft.com/en-us/windows/win32/vss/using-the-volume-shadow-copy-service

Zweibach commented 4 years ago

Open Office XML document support as per #362 , see this comment in that issue for more information.

bbappserver commented 4 years ago

One of those Discord screenshots implies that filetypes without visual media viewer support can't have thumbnails. I don't see why that would need to be the case. Some applications generate their own thumbnails for their own filetypes. For example, LibreOffice has a Windows Explorer extension that generates thumbnails for documents - could Hydrus just grab those thumbnails for those files?

Audio files could have thumbnails generated based on their waveform. ffmpeg could be used for this.

That's not actually how thumbnailing works in operating systems. A program supplies a small subprogram to the OS for extracting a preview from a document, and either this subprogram is sufficient to perform playback (for example of certain video codecs) and/or produces a bitmap for the poster frame. This requires the installation of OS specific components, and is not portable across OSes, and is not known to other programs. At best there might be a way to extract some non preview, default icons for some files under some known common operating systems.

bbappserver commented 4 years ago

Also if I'm not mistaken a swf does not possess a concept of a poster frame and can have an arbitrary playback order, so generating a preview image instead of an icon is basically impossible, which is why hydrus just uses an icon embeded in hydrus.

floogulinc commented 4 years ago

Also if I'm not mistaken a swf does not possess a concept of a poster frame and can have an arbitrary playback order, so generating a preview image instead of an icon is basically impossible, which is why hydrus just uses an icon embeded in hydrus.

SWFs do have thumbnails in Hydrus.

bb010g commented 4 years ago

@bbappserver The fact that OSs handle thumbnailing differently doesn't mean you can't interact with them. See https://thumbsviewer.github.io/ & https://github.com/mdegrazia/OSX-QuickLook-Parser. Also, at least on Windows, https://github.com/QL-Win/QuickLook is should be usable just fine as a library with a bit of patching.

Hydrus doesn't have to support a huge number of extremely different operating systems.

imtbl commented 4 years ago

I would prefer not to rely on OS-specific implementations for thumbnails, especially not if it means having to use a different library for each of them; imo, that's not worth it (both in terms of initial implementation and having to maintain it in the future) just to gain the ability to have content-based thumbnails for file types most users will likely never put into Hydrus. It would potentially also complicate things like the ability for the user to force re-generate thumbnails.

DonaldTsang commented 4 years ago

Supporting different file types UI wise (browsing) should be trivial, however launching them and integrating features into the application is not. Not everything is supported by MPV or browsers. @ShadowJonathan https://github.com/openpreserve/fido and https://github.com/richardlehane/siegfried are the gold standards for file identification in the archive community, they seem to do everything, from magic numbers to "deeper checks". Others include:

Suika commented 4 years ago

Easier would be to add a check box, that would allow the import of binary data of any type. The whole processing, thumbnailing and what not can come at the later stages. Basic enforcement of any type of data is probably more important.

PillowLounge commented 4 years ago

I have a collection of SVG icons that would feel at home together with their png brethren in hydrus. I'd assume support would be less work than any 3d, interactive or animated media.

floogulinc commented 3 years ago

Clip Studio Paint (.clip) files have also been requested. #810

souxd commented 1 year ago

I'm not familiar with that software in particular, but couldn't a lot of mangling of this sort be prevented by having hydrus set all its media file permissions to read-only?

Yes that sounds good

I proposed a simpler workaround in Discord that could also work, but surely isn't as ideal or as desirable

If you need to export the file wouldn't it be easier and safer to make a copy with the changes and remove the older asset/project?

private02E4 commented 10 months ago

With the recent addition of CBZ support, I'd like to throw in the suggestion of additional compressed filetypes. I would love to spin up a Hydrus database that could archive my Photoshop/Clip/Krita projects, but their file sizes can be pretty massive. I store my finished projects in ZIP, which often cuts the size in half (e.g. 1gb -> 500mb). If we could support something like this, it would open up Hydrus for completely new use-cases.

With the new "Force filetype" option, I can actually force the zip as a PSD, regenerate thumbnail (it will default to the PSD filetype icon), and then overwrite that manually with a thumbnail I made myself. Hydrus won't use the thumbnail file if it's just a ZIP though, and when "forced" as a PSD, "open in external program" will ask Photoshop to open the ZIP file.

I'm sure this is easier said than done. I'm not familiar with how the image project files are handled, but streaming them through a pkzip to an image library may pose a challenge.

A simpler approach that I think may be a good solution to supporting many different file formats:

Just my $0.02

floogulinc commented 10 months ago

I would love to spin up a Hydrus database that could archive my Photoshop/Clip/Krita projects, but their file sizes can be pretty massive. I store my finished projects in ZIP, which often cuts the size in half (e.g. 1gb -> 500mb).

Krita files are already ZIPs themselves but are uncompressed by default. You can just change a setting in Krita to use compression thought instead of putting it in another zip. Photoshop also has the option for compression of the layers in PSD files which can use either RLE or ZIP compression.

Cguy7777 commented 9 months ago

Could JPEG XL support be added using pillow-jxl-plugin?

floogulinc commented 9 months ago

@Cguy7777 I have tried using that for jpeg-xl but it was too unstable and when it crashed it causes the entire client to hang instead of throwing an exception.