joniles / mpxj

Primary repository for MPXJ library
http://www.mpxj.org/
GNU Lesser General Public License v2.1
244 stars 101 forks source link

MS Word document resulted from RTFEmbeddedObject.getData() byte array cannot be opened #118

Open FabioRNT opened 5 years ago

FabioRNT commented 5 years ago

Hello, I'm trying to extract an MS Word file embedded in an RTF file by using RTFEmbeddedObject.getEmbeddedObjects(String file). The method returns a list with four instances, which is expected. When I check the resulting data array with Apache Tika, it returns the application/x-tika-msoffice mime type, which seems correct.

However, when I try to open the resulting file, it doesn't show the expected result on MS Word. I will attach both files on this issue.

here's the code that I'm using:

` List<List> rtfl = RTFEmbeddedObject.getEmbeddedObjects(readLineByLine(file));

    for(List<RTFEmbeddedObject> l : rtfl){

        FileUtils.writeByteArrayToFile(new File
                ("test.doc"),
                l.get(1).getData());

        Tika t = new Tika();

        String s = t.detect(l.get(1).getData());

        System.out.println("Mimetype: " + s);

    }

`

Attachments at: rtfword.zip

Thanks in advance!

joniles commented 5 years ago

Just to confirm, is the test.doc included in the zip file the original file which was embedded in the RTF, or one you have extracted yourself?

joniles commented 5 years ago

Also... if possible could you include the MPP file that the RTF came from?

FabioRNT commented 5 years ago

Hello, the test.doc file is the one that I extracted using the library. About the RTF, it wasn't from an MPP file. It was from a database that exported OLE objects for me, and I've been able to convert them to RTF and access them as embedded objects.

joniles commented 5 years ago

Thanks for the update. Do you have a way to get the original OLE object out of the database without going through the RTF export exercise your describe? I'm looking at starting with a "known good" file which MS Word can open, then comparing that to what we're able to extract from the RTF.

FabioRNT commented 5 years ago

I'll upload an original OLE file, but it isn't openable by MS Word. In order to be able to open it, I have to add a header and convert it to RTF. ole.zip