unoconv / unoserver

MIT License
496 stars 69 forks source link

--convert-to "txt:Text (encoded):UTF8" #68

Closed varna9000 closed 1 year ago

varna9000 commented 1 year ago

Hi, I struggle to convert docx with unoserver. If I apply --convert-to "txt:Text (encoded):UTF8" to plain libreoffice convert, the file converts ok to plain text.

libreoffice --headless --convert-to "txt:Text (encoded):UTF8" 65726.docx
convert 65726.docx -> 65726.txt using filter : Text (encoded):UTF8
Overwriting: 65726.txt

However if I use the same flag with unoconverter,

unoconvert --convert-to "txt:Text (encoded):UTF8" 65726.docx test.txt

the following error occurs:

.local/lib/python3.7/site-packages/unoserver/converter.py", line 180, in convert f"Unknown export file type, unknown extension '{extension}'" RuntimeError: Unknown export file type, unknown extension 'txt:Text (encoded):UTF8'

If I try with simple --convert-to "txt" result is:

/python3.7/site-packages/unoserver/converter.py", line 186, in convert f"Could not find an export filter from {import_type} to {export_type}" RuntimeError: Could not find an export filter from com.sun.star.text.TextDocument to writer_T602_Document

regebro commented 1 year ago

--convert-to in unoconvert is the desired file type. In your case "txt". You don't have to specify the exact filter, unoconvert finds it for you. In the case of there being multiple filters you can specify one with the --filter option, but that is usually not necessary.

varna9000 commented 1 year ago

Trouble is that I tried without the option

with this command unoconvert 65726.docx test.txt it gives the same error:

lib/python3.7/site-packages/unoserver/converter.py", line 186, in convert
    f"Could not find an export filter from {import_type} to {export_type}"
RuntimeError: Could not find an export filter from com.sun.star.text.TextDocument to writer_T602_Document

Juts as a comparison, convertion to pdf works fine:

unoconvert 65726.docx test.pdf
INFO:unoserver:Starting unoconverter.
INFO:unoserver:Opening 65726.docx
INFO:unoserver:Exporting to test.pdf
INFO:unoserver:Using writer_pdf_Export export filter
regebro commented 1 year ago

Have you tried using the --filter argument?

In your configuration the "txt" file ending is attached to two different file types, and it chooses the writer_T602_Document bu default. You can probably change those extension attachments as well.

varna9000 commented 1 year ago

Ok, with this unoconvert --filter 'Text (encoded)' 65726.docx test.txt file converted ok :) It appears auto recognition doesn't work for encoded text output file.

Edit: with Text only filter, also converts ok. But I have to explicitly pass the filter.

In your configuration the "txt" file ending is attached to two different file types, and it chooses the writer_T602_Document bu default. You can probably change those extension attachments as well.

Ah, ok. This might be the problem. God knows what libreoffice has set as default.