research-technologies / leaf_addons

Additional functionality for Hyrax or Hyku repositories (importers, tasks etc.)
Apache License 2.0
5 stars 0 forks source link

500 error in FCREPO when attempting to add extracted_text to an existing object #11

Open ghost opened 7 years ago

ghost commented 7 years ago

When updating the extracted text, I'm seeing the following error. This happens when the original file is TXT. I have observed it when adding to a PDF original_file, but this works more consistently than not.

No problem when I do a direct add of file into 'files' container via GUI and via CURL (as expected)

This is the code:

local_file = Hydra::Derivatives::IoDecorator.new(File.open(path, "rb"))
          local_file.original_name = path.split('/').last
          local_file.mime_type = Hydra::Works::DetermineMimeType.call(local_file, local_file.original_name)
          Hydra::Works::AddFileToFileSet.call(fileset,
                                              local_file,
                                              type.to_sym,
                                              versioning: false)

It works via this very manual method:

I've ruled out the being the file itself and also that it's a Fedora problem proper.

I can add this file to one fileset, but not the other, which hints at something to do with the original_file file in the fileset, which is a plain text file. It seems like it's something the AF / LDP are trying to do but I'm struggling to figure out is WHAT AF / LDP are doing ... I get nothing in the logs for any of this.

In FCRepo logs:

INFO 15:06:37.142 (FedoraLdp) PUT resource 'dev/6d/9c/fd/d2/6d9cfdd2-a775-4eb7-9575-f53f816f33f0/files/24eee720-d0a4-4ab3-a345-4d32bcdba79d'
DEBUG 15:06:37.142 (FedoraBinaryImpl) Created content node at path: /dev/6d/9c/fd/d2/6d9cfdd2-a775-4eb7-9575-f53f816f33f0/files/24eee720-d0a4-4ab3-a345-4d32bcdba79d/jcr:content
ERROR 15:06:57.163 (RepositoryExceptionMapper) Caught a repository exception: java.net.SocketTimeoutException: Read timed out

In Hyku logs:

Ldp::HttpError: STATUS: 500 org.modeshape.jcr.value.binary.BinaryStoreException: java.net.SocketTimeoutException: Read timed out
    at org.modeshape.jcr.value.binary.FileSystemBinaryStore.storeValue(FileSystemBinaryStore.java:128)
    at org.modeshape.jcr.value.binary.AbstractBinaryStore.storeValue(AbstractBinaryStore.java:251)
    at org.modeshape.jcr.value.binary.BinaryStoreValueFactory.create(BinaryStoreValueFactory.java:257)
    at org.modeshape.jcr.value.binary.BinaryStoreValueFactory.create(BinaryStoreValueFactory.java:49)
    at org.modeshape.jcr.JcrValueFactory.createBinary(JcrValueFactory.java:149)
    at org.modeshape.jcr.JcrValueFactory.createBinary(JcrValueFactory.java:41)
    at org.fcrepo.kernel.modeshape.FedoraBinaryImpl.setContent(FedoraBinaryImpl.java:178)
    at org.fcrepo.http.api.ContentExposingResource.replaceResourceBinaryWithStream(ContentExposingResource.java:612)
    at org.fcrepo.http.api.FedoraLdp.createOrReplaceObjectRdf(FedoraLdp.java:361)

See samvera-tech post

ghost commented 6 years ago

In the end I decided to create separate files rather than add my own 'extraced_text', but this problem was never solved.

whikloj commented 5 years ago

Hey @geekscruff ,

I'm looking at this issue from the Fedora side and (trying) to set up a Hyku box to test. Can you give me any details about the size, structure of the object and whether your system was under load when this happened. Also did it happen consistently with the above mentioned object?

ghost commented 5 years ago

Hello @whikloj ... the was all done on a dev instance, and I was running a migration which was adding a PDF and TXT file to a Hyrax/Hyku work, and then adding an 'extracted text' to the TXT, so another file into the 'file set'. It failed consistently on that, but would add the extracted text to the (larger) PDF.

whikloj commented 5 years ago

So, just so I've got this clear in my mind. The PDF is your pcdm:Object with a pcdm:FileSet containing the TXT text (which is provided) but then you extract the text from the PDF and add that to the same pcdm:FileSet.

My understanding of the Samvera content model is weak, so correct me where I am wrong.

pcdm:Object (the PDF) -> pcdm:FileSet -> pcdm:File (PDF)
                       ↳ pcdm:FileSet -> pcdm:File (TXT)
                                       ↳ pcdm:File (extracted text) [ this causes the boom ]
ghost commented 5 years ago

Yes, and ...

pcdm:Object (the PDF) -> pcdm:FileSet -> pcdm:File (PDF)
                                       ↳ pcdm:File (extracted text) [ no boom ]
                       ↳ pcdm:FileSet -> pcdm:File (TXT)

In the end I dispensed with the extra TXT and added extracted text to the PDF FileSet.