irods / irods_client_http_api

An iRODS C++ HTTP API using Boost.Beast
BSD 3-Clause "New" or "Revised" License
0 stars 6 forks source link

Curl fails to upload some files, but succeeds with other files #273

Open korydraughn opened 2 months ago

korydraughn commented 2 months ago

Bug Report

First reported at the following:

Initial investigation points to the parser used for multipart/form-data. Something makes the parser believe it has reached the end of the stream. Or maybe curl isn't sending the full payload?

[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Header: (Host, localhost:9000)
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Header: (User-Agent, curl/7.68.0)
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Header: (Accept, */*)
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Header: (Authorization, Bearer 28b40773-1774-4e53-b811-2c2077f719fc)
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Header: (Content-Length, 112037)
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Header: (Content-Type, multipart/form-data; boundary=------------------------63f21c0b0de45459)
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Method: POST
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Version: 11
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Target: /irods-http-api/0.2.0/data-objects
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Keep Alive: true
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Has Content Length: true
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Chunked: false
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] on_read: Needs EOF: false
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] PARAM = (name=boundary, value=------------------------63f21c0b0de45459)
[2024-04-23 23:59:28.156] [P:1] [trace] [T:30] PARAM = (name=name, value=bytes)
[2024-04-23 23:59:28.156] [P:1] [trace] [T:30] END OF HEADERS
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] parse_multipart_form_data: CAPTURED CONTENT SIZE = [17984]
[2024-04-23 23:59:28.156] [P:1] [trace] [T:30] END OF CONTENT
[2024-04-23 23:59:28.156] [P:1] [trace] [T:30] PARAM = (name=name, value=op)
[2024-04-23 23:59:28.156] [P:1] [trace] [T:30] END OF HEADERS
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] parse_multipart_form_data: CAPTURED CONTENT SIZE = [5]
[2024-04-23 23:59:28.156] [P:1] [trace] [T:30] END OF CONTENT
[2024-04-23 23:59:28.156] [P:1] [trace] [T:30] PARAM = (name=name, value=lpath)
[2024-04-23 23:59:28.156] [P:1] [trace] [T:30] END OF HEADERS
[2024-04-23 23:59:28.156] [P:1] [debug] [T:30] parse_multipart_form_data: CAPTURED CONTENT SIZE = [29]
[2024-04-23 23:59:28.156] [P:1] [trace] [T:30] END OF CONTENT
[2024-04-23 23:59:28.156] [P:1] [trace] [T:30] END OF REQUEST
ll4strw commented 2 months ago

Hello, I am wondering whether the configuration parameter

 // The maximum size allowed for the body of a request.
            "max_size_of_request_body_in_bytes"

could play a role in this?

Also, what does truncate (defaults to true?) do in the example below taken from the docs?

curl http://localhost:<port>/irods-http-api/<version>/data-objects \
    -H 'Authorization: Bearer <token>' \
    [-F,--data-urlencode] 'op=write' \
    [-F,--data-urlencode] 'lpath=<string>' \ # Absolute logical path to a data object.
    [-F,--data-urlencode] 'resource=<string>' \ # The root resource to write to. Optional.
    [-F,--data-urlencode] 'offset=<integer>' \ # Number of bytes to skip. Defaults to 0. Optional.
    [-F,--data-urlencode] 'truncate=<integer>' \ # 0 or 1. Defaults to 1. Truncates the data object before writing. Optional.
    [-F,--data-urlencode] 'append=<integer>' \ # 0 or 1. Defaults to 0. Appends the bytes to the data object. Optional.
    [-F,--data-urlencode] 'bytes=<binary_data>;type=application/octet-stream' \ # The bytes to write

Many thanks in advance.

trel commented 2 months ago

Default of truncate=1 just means that this transfer should clear/overwrite the existing file, if there is already one in place at the target logical path.

I'm expecting we'll find a null byte or something in these files and we're not handling that case as we should. The same file being uploaded more than once and getting the identical incorrect filesize suggests it's not random / memory layout.

ll4strw commented 2 months ago

Glancing at the code https://github.com/irods/irods_client_http_api/blob/48b63758b056e11008a6b1e876f0ed427a3f39c2/core/src/crlf_parser.cpp#L124-L129 is binary content just treated as ASCII? For me the reported problem disappears if I actually transmit a hex dump of any arbitrary binary attachment and then I reverse the the dump to its original binary form with xdd -r

xxd PatMartino_Activity_I.jpg | curl http://localhost:9000/irods-http-api/0.3.0/data-objects -H "Authorization: Bearer $(oidc-token irods_http_api)" -F 'op=write' -F "lpath=/laptopZone/home/rods/PMI.jpg" -F bytes=@- 
{"irods_response":{"status_code":0}}

The command above preserves the attachment size and the original content can be retrieved with xxd. Chances are then that if a binary content is treated as such, maybe the reported problem could be solved?

korydraughn commented 2 months ago

The parser may be interpreting a null byte during string construction.

This will be resolved in the next release, which won't happen until after UGM.

korydraughn commented 2 months ago

After some digging, it appears the parser is incorrectly handling data with embedded CRLFs.

I've modified the parser and the change seems to resolve the issue (assuming I reproduced the same bug). See the following branch for the changes.

Please give it a try and let us know if the problem persists.

The real solution is to rewrite the parser, which will happen for 0.4.0.

ll4strw commented 1 month ago

Thanks for looking into this, but it still does not work for me. Here follows more info. The test file is a jpg image

$ ls -l PatMartino_Activity_I.jpg 
-rw-rw-r-- 1 leonardo leonardo 92866 jan 23 15:42 PatMartino_Activity_I.jpg
$ file PatMartino_Activity_I.jpg
PatMartino_Activity_I.jpg: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 904x1298, components 3

which I send to the iRODS server via

curl -v http://localhost:9000/irods-http-api/0.3.0/data-objects -H "Authorization: Bearer $(oidc-token irods_http_api)" -F 'bytes=@PatMartino_Activity_I.jpg;type=application/octet-stream'  -F 'op=write' -F 'lpath=/laptopZone/home/rods/PMI.jpg'

*   Trying 127.0.0.1:9000...
* Connected to localhost (127.0.0.1) port 9000 (#0)
> POST /irods-http-api/0.3.0/data-objects HTTP/1.1
> Host: localhost:9000
> User-Agent: curl/7.81.0
> Accept: */*
> Authorization: Bearer XYXYX
> Content-Length: 93303
> Content-Type: multipart/form-data; boundary=------------------------31ab4593999fb662
> 
* We are completely uploaded and fine
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: irods_http_api/0.3.0 (48b63758b056e11008a6b1e876f0ed427a3f39c2)
< Content-Type: application/json
< Content-Length: 36
< 
* Connection #0 to host localhost left intact
{"irods_response":{"status_code":0}}

On the iRODS server I see a "truncated" image

ils -l PMI.jpg
  rods              0 laptopResc        23727 2024-05-13.10:22 & PMI.jpg

which cannot be displayed because of a Premature end of JPEG file.

The API server logs report

[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] on_read: Header: (Content-Length, 93303)
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] on_read: Header: (Content-Type, multipart/form-data; boundary=------------------------31ab4593999fb662)
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] on_read: Method: POST
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] on_read: Version: 11
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] on_read: Target: /irods-http-api/0.3.0/data-objects
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] on_read: Keep Alive: true
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] on_read: Has Content Length: true
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] on_read: Chunked: false
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] on_read: Needs EOF: false
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] PARAM = (name=boundary, value=------------------------31ab4593999fb662)
[2024-05-13 10:22:23.627] [P:21158] [trace] [T:21192] PARAM = (name=name, value=bytes)
[2024-05-13 10:22:23.627] [P:21158] [trace] [T:21192] END OF HEADERS
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] parse_multipart_form_data: CAPTURED CONTENT SIZE = [23727]
[2024-05-13 10:22:23.627] [P:21158] [trace] [T:21192] END OF CONTENT
[2024-05-13 10:22:23.627] [P:21158] [trace] [T:21192] PARAM = (name=name, value=op)
[2024-05-13 10:22:23.627] [P:21158] [trace] [T:21192] END OF HEADERS
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] parse_multipart_form_data: CAPTURED CONTENT SIZE = [5]
[2024-05-13 10:22:23.627] [P:21158] [trace] [T:21192] END OF CONTENT
[2024-05-13 10:22:23.627] [P:21158] [trace] [T:21192] PARAM = (name=name, value=lpath)
[2024-05-13 10:22:23.627] [P:21158] [trace] [T:21192] END OF HEADERS
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] parse_multipart_form_data: CAPTURED CONTENT SIZE = [29]
[2024-05-13 10:22:23.627] [P:21158] [trace] [T:21192] END OF CONTENT
[2024-05-13 10:22:23.627] [P:21158] [trace] [T:21192] END OF REQUEST
[2024-05-13 10:22:23.627] [P:21158] [debug] [T:21192] resolve_client_identity: Authorization value: XYXY
[2024-05-13 10:22:23.636] [P:21158] [debug] [T:21192] hit_introspection_endpoint: Received the following response:  zzzzzz
[2024-05-13 10:22:23.636] [P:21158] [info] [T:21186] op_write: client_info.username = [rods]
[2024-05-13 10:22:23.636] [P:21158] [trace] [T:21186] op_write: Opening data object [/laptopZone/home/rods/PMI.jpg] for write.
[2024-05-13 10:22:23.636] [P:21158] [trace] [T:21186] op_write: (write) Initializing for single buffer write.
[2024-05-13 10:22:23.637] [P:21158] [trace] [T:21186] get_connection: Changing identity associated with connection to [rods].
[2024-05-13 10:22:23.637] [P:21158] [trace] [T:21186] get_connection: Successfully changed identity associated with connection to [rods].
[2024-05-13 10:22:23.681] [P:21158] [debug] [T:21186] op_write: Write buffer: remaining=[23727], sending=[8192].
[2024-05-13 10:22:23.682] [P:21158] [debug] [T:21186] op_write: Write buffer: remaining=[15535], sending=[8192].
[2024-05-13 10:22:23.682] [P:21158] [debug] [T:21186] op_write: Write buffer: remaining=[7343], sending=[7343].

Finally, I tried the API with a different image, this time in png format, and got the same problem. Are binary streams actually opened as binary in the API? Thanks, L.

korydraughn commented 1 month ago

Hmm, that's surprising.

Please share/attach the image files you used to test the code change. If you have text/binary files which also demonstrate the issue, please attach those too.

korydraughn commented 1 month ago

Finally, I tried the API with a different image, this time in png format, and got the same problem. Are binary streams actually opened as binary in the API? Thanks, L.

That depends on the underlying filesystem being used. If we're talking about a normal POSIX filesystem, then essentially, all files are opened in binary mode (because POSIX doesn't have the concept of text/binary mode for the open API).

To answer your question, kind of. Text/Binary mode can be ignored for iRODS streams interacting with POSIX filesystems.

ll4strw commented 1 month ago

Apologies, give me a moment please...I used a completely wrong commit. I will report the results in a few minutes.

ll4strw commented 1 month ago

Commit 72ef848 works. Thanks for your work and apologies again for the mess up..

korydraughn commented 1 month ago

Excellent. Once UGM is done, we'll look into an improved implementation.

It would be great if you can attach the files which caused the error. I'd like to include them as part of our testing. If the files are huge, then don't worry about it.

ll4strw commented 1 month ago

The following png

ul-algemeen-rgb-kleur

and any of the example h5 files here and fits files here. Thanks in advance.