kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.47k stars 448 forks source link

HTTP request gives 'Access-Control-Allow-Origin' error #225

Closed rawsh closed 7 years ago

rawsh commented 7 years ago

Curl using

curl -v --form input=@./TestPDF.pdf localhost:8080/processHeaderDocument

works, but trying the equivalent POST request through js (e.g.)

var data = new FormData();
data.append("input", "@./TestPDF.pdf ");

var xhr = new XMLHttpRequest();
xhr.withCredentials = true;

xhr.addEventListener("readystatechange", function () {
  if (this.readyState === 4) {
    console.log(this.responseText);
  }
});

xhr.open("POST", "http://localhost:8080/processHeaderDocument");

xhr.send(data);

fails with

XMLHttpRequest cannot load http://localhost:8080/processHeaderDocument. Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:42781' is therefore not allowed access.

I've read that this can be fixed by adding a header 'Access-Control-Allow-Origin: clientside.com' to the server. Is there a config file or somewhere that I can add the url I want to allow to make POSTs?

kermitt2 commented 7 years ago

Hello,

Thank you for pointing this out to us. I think we can indeed allow CORS for all the Grobid web services by default, assuming that these are "read-only" services and the main usage would be public or internal pipelines not open to extranet.

If well informed people are reading this, are there in general safety issues with allowing CORS for all services that we should be aware of?

Commit 819b9b0424022859c1317da0e5e8d73d79707209 added CORS. @rawsh would it be possible for you to test if it is now working properly for cross domain Ajax request? Many thanks.

rawsh commented 7 years ago

Thank you for the quick response!

@kermitt2 it looks like I'm still getting the same error. I git cloned the latest repo and that commit is in the history. Here is a screenshot, same js code:

screenshot from 2017-08-26 16-11-31

kermitt2 commented 7 years ago

@rawsh thanks! You don't need Access-Control-Allow-Origin: * in your request (that comes in the server response) and for dataType - as the result is XML - xhr.setRequestHeader("dataType", "text"); (which should work better than xhr.setRequestHeader("dataType", "text/xml");). You should also not used I think xhr.withCredentials(true) because there is no authentication mechanism for the cross-site request.

Now I don't know if these changes will make the request working :D

I've tested on my machine and could send request from domain localhost:8000 to server running on localhost:8080 - normally different ports on the same domain are already considered cross domain.

rawsh commented 7 years ago

@kermitt2 Looks like removing what you told me about brings something different; there is still the error with not passing the control check but now I actually get a 500 error from the server, when the same thing from curl works (I tried importing the curl command to postman and I get the same thing).

Curl command that works:

curl -v --form input=@/home/robert/Documents/Belmont/pdf-summarizer/examplepdfs/1.pdf localhost:8080/processHeaderDocument

JS that fails:

var data = new FormData();
data.append("input", "/home/robert/Documents/Belmont/pdf-summarizer/examplepdfs/1.pdf");

var xhr = new XMLHttpRequest();

xhr.addEventListener("readystatechange", function () {
  if (this.readyState === 4) {
    console.log(this.responseText);
  }
});

xhr.open("POST", "http://localhost:8080/processHeaderDocument");
xhr.setRequestHeader("dataType", "text");
xhr.send(data);

Here is the output from grobid

Would you mind sending me the js code you used so I can test?


As for safety issues, CORS (at least using access: *) lets anybody post it, so someone could use the server without permission. I think being able to edit the access allow origin header and adding domains (or just turning CORS on/off) would be nice.

kermitt2 commented 7 years ago

Ok so the good news is that Ajax CORS request is working!

There's a problem in the way you build your Ajax query, the PDF should be passed like this I think:

formData.append("input", "file:///home/robert/documents/Belmont/pdf-summerizer/examplespdfs/1.pdf");

But but but normally browsers will not allow request with something in the local file system (Chrome for sure will not allow it, Firefox maybe) - you could do it with node.js but I think otherwise it will be considered as a (major) security flaw. You can modify the settings but it's dangerous.

The javascript that I am using for testing this is the console javascript application in grobid-service, see grobid/grobid-service/src/main/webapp/grobid/grobid.js in particular lines 594-623 which build the Ajax query for this service. As you can see, I am using directly the HTML form for building the FormData object, so no security problem.

rawsh commented 7 years ago

Thank you for all the help! I finally got it working with

<form id="pdfform" onsubmit="return sendPost();">
  <input type="file" name="input" accept="pdf">
  <input type="submit">
</form>

<script>
function sendPost() {
    var form = document.getElementById('pdfform');
    var formData = new FormData(form);

    var xhr = new XMLHttpRequest();
    var url = "http://localhost:8080/processHeaderDocument";

    xhr.responseType = 'text';
    xhr.open('POST', url, true);

    xhr.onreadystatechange = function(e) {
        if (xhr.readyState == 4 && xhr.status == 200) {
            console.log(e.target.response);
        } else if (xhr.status != 200) {
            console.log(xhr);
        }
    };

    xhr.send(formData);

    return false;
}
</script>

I think an example in the docs would be awesome so other devs can avoid my pain :sweat_smile:

rawsh commented 7 years ago

Huh. Trying the processFulltextAssetDocument path gives me the XHR error when the other ones do not. I'm reading the data with a blob and jspdf, @kermitt2 do you know whats going on?

EDIT: looks like I just needed to add a .header("Access-Control-Allow-Origin", "*").header("Access-Control-Allow-Methods", "GET, POST, DELETE, PUT") to the responseAsset function in the process file java code

kermitt2 commented 7 years ago

Yes I didn't update this service because I plan to remove it later in September - it leads to heavy problems for some crazy PDF - for instance I had one PDF of around 10 pages with more than 40 000 embedded images - because it contains one embedded bitmap file for each line of a picture. This can be quite common for some publishers and it can make in practice the server down.

So I don't recommend you to use it. It will be replaced by another service working with crops rather than the embedded images, ensuring we don't have explosion of asset files.

rawsh commented 7 years ago

@kermitt2 understood, thanks. Looking forward to the new image service.