unoconv / unoserver

MIT License
496 stars 69 forks source link

Possibility to get a list of supported input mime types or file extensions? #13

Closed mara004 closed 5 months ago

mara004 commented 2 years ago

I need some way to check whether a file is of an input type supported by UnoConverter.convert() or not, for there are multiple different importers in my library. Thus, it would be useful to have a list of mime types (or file extensions) that unoserver accepts. It seems that this functionality was present in unoconv with the Fmt/FmtList classes. Would it be possible to add this back to unoserver?

regebro commented 2 years ago

Unoserver should accept what Libreoffice accepts. Unlike unoconv there is no hardcoded list in Unoserver, it just asks LibreOffice.

mara004 commented 2 years ago

Sure, but that doesn't change that I need the information which formats are supported. However, I agree that a hardcoded list is not elegant. Does LibreOffice/pyuno provide any way to rertrieve the supported extenstions/mimes?

regebro commented 2 years ago

I don't remember, but I can take a look. Currently, unoconvert only looks up the supported filter for the given document type. There may be a way to list all of them.

regebro commented 5 months ago

I can't find any way to get a list of extensions or file types. You can get a list of input and output filters, but the naming is not consistent, and only some of them include information on extensions. But if you now run it with --input-filter or --output-filter and type some nonsense, it will return a list of the existing filters you can use.

It's not elegant, but it's the best we can do.

mara004 commented 5 months ago

Thanks for coming back to this (fun fact: it's been exactly 2 years). I'll take a look when I have time. Is there any official specification of these input filters? Or what was unoconv's FmtList table derived from?

regebro commented 5 months ago

I haven't found any official specification, and especially the lack of file ending information is annoying. Unoconv has lists that were manually maintained, it was one of the problems with it.

The filter lists will look something like this::

Available filters: ['CWW8', 'CXML', 'CXMLV', 'CXMLVWEB', 'Calc MS Excel 2007 VBA XML', 'Calc MS Excel 2007 XML', 'Calc MS Excel 2007 XML Template', 'Calc Office Open XML', 'DIF', 'DocBook File', 'EPUB', 'HTML', 'HTML (StarCalc)', 'HTML (StarWriter)', 'Impress MS PowerPoint 2007 XML', 'Impress MS PowerPoint 2007 XML AutoPlay', 'Impress MS PowerPoint 2007 XML Template', 'Impress MS PowerPoint 2007 XML VBA', 'Impress Office Open XML', 'Impress Office Open XML AutoPlay', 'Impress Office Open XML Template', 'MS Excel 2003 XML', 'MS Excel 97', 'MS Excel 97 Vorlage/Template', 'MS PowerPoint 97', 'MS PowerPoint 97 AutoPlay', 'MS PowerPoint 97 Vorlage', 'MS Word 2003 XML', 'MS Word 2007 XML', 'MS Word 2007 XML Template', 'MS Word 2007 XML VBA', 'MS Word 97', 'MS Word 97 Vorlage', 'MathML XML (Math)', 'MathType 3.x', 'OOXML', 'OXML', 'Office Open XML Text', 'Office Open XML Text Template', 'OpenDocument Drawing Flat XML', 'OpenDocument Presentation Flat XML', 'OpenDocument Spreadsheet Flat XML', 'OpenDocument Text Flat XML', 'RTF', 'Rich Text Format', 'SYLK', 'TEXT', 'TEXT_DLG', 'Text', 'Text (StarWriter/Web)', 'Text (encoded)', 'Text (encoded) (StarWriter/GlobalDocument)', 'Text (encoded) (StarWriter/Web)', 'Text - txt - csv (StarCalc)', 'UOF presentation', 'UOF spreadsheet', 'UOF text', 'XHTML Calc File', 'XHTML Draw File', 'XHTML Impress File', 'XHTML Writer File', 'XML', 'calc8', 'calc8_template', 'calc_jpg_Export', 'calc_pdf_Export', 'calc_png_Export', 'calc_svg_Export', 'calc_webp_Export', 'chart8', 'dBase', 'draw8', 'draw8_template', 'draw_bmp_Export', 'draw_emf_Export', 'draw_emz_Export', 'draw_eps_Export', 'draw_gif_Export', 'draw_html_Export', 'draw_jpg_Export', 'draw_pdf_Export', 'draw_png_Export', 'draw_svg_Export', 'draw_svgz_Export', 'draw_tif_Export', 'draw_webp_Export', 'draw_wmf_Export', 'draw_wmz_Export', 'impress8', 'impress8_draw', 'impress8_template', 'impress_bmp_Export', 'impress_emf_Export', 'impress_eps_Export', 'impress_gif_Export', 'impress_html_Export', 'impress_jpg_Export', 'impress_pdf_Export', 'impress_png_Export', 'impress_svg_Export', 'impress_tif_Export', 'impress_webp_Export', 'impress_wmf_Export', 'macro-enabled', 'math8', 'math_pdf_Export', 'sd', 'writer8', 'writer8_template', 'writer_globaldocument_pdf_Export', 'writer_indexing_export', 'writer_jpg_Export', 'writer_layout_dump', 'writer_pdf_Export', 'writer_png_Export', 'writer_svg_Export', 'writer_web_StarOffice_XML_Writer', 'writer_web_jpg_Export', 'writer_web_pdf_Export', 'writer_web_png_Export', 'writer_web_webp_Export', 'writer_webp_Export', 'writerglobal8', 'writerglobal8_HTML', 'writerglobal8_template', 'writerglobal8_writer', 'writerweb8_writer', 'writerweb8_writer_template']

So it's rather ugly, but usable. And it's not always obvious what the differences are either, so often the best way is to just try everything that looks reasonable and see what works best. :-D

mara004 commented 5 months ago

I wonder if there would be any way to create/update file ending info programatically? Like a script that iterates over all available filters, converts an input and checks the resulting file type?

But then this would miss import-only filters (I haven't checked if there are any – or in the worst case, filters might be completely divided).

regebro commented 5 months ago

No, because the extension is given as an input. If you don't specify a filter LibreOffice will guess a filter based on the extension, but I can't find any way of accessing that mapping through the uno API.

mara004 commented 5 months ago

No, because the extension is given as an input.

I don't remember unoserver too well given the time that has passed. But, I thought we can pass an export filter via filtername and outpath=None, and then detect the resulting file type via python-magic, assuming the document types all have embedded magic byte signatures in the file header? Then no export extension would be needed as input ... ?

regebro commented 5 months ago

That might work, but it's easier to give the extension and let LibreOffice guess the filter. :-D

mara004 commented 5 months ago

Easier for a natural user, yes. However, if we can squeeze the supported filters out of libreoffice but not the extensions, then this seems like our only chance to (automatically) deduce the desired mapping, no?

regebro commented 5 months ago

Yeah, it might work.

mara004 commented 5 months ago

I hope I can give it a try at some point. Yet, the cleaner (but more difficult) way might be to dive into upstream and propose a patch to just expose the internal mapping. On the other hand, it might take ages until we could use such an API without breaking older installs.

Utopiah commented 4 months ago

Is it based on the Import column of https://wiki.openoffice.org/wiki/Framework/Article/Filter/FilterList_OOo_3_0 ?

regebro commented 4 months ago

No, it lists the available filters and prints out the names.

On Tue, Mar 5, 2024 at 10:28 AM Fabien Benetou @.***> wrote:

Is it based on the Import column of https://wiki.openoffice.org/wiki/Framework/Article/Filter/FilterList_OOo_3_0 ?

— Reply to this email directly, view it on GitHub https://github.com/unoconv/unoserver/issues/13#issuecomment-1978313082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGIK5G5SAKXMM46WCU6GVLYWWF2DAVCNFSM5K6QFUQKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJXHAZTCMZQHAZA . You are receiving this because you modified the open/close state.Message ID: @.***>