kbase / staging_service

MIT License
0 stars 9 forks source link

PTV-1882 upload fixes and improvements #189

Open eapearson opened 1 year ago

eapearson commented 1 year ago

PR Description

Main bug fixes

The primary motivation for this set of changes was initially to ensure that the service could handle upload of files greater than 5GB, for there were reports that it could not.

Crash reading file for md5 calculation

Indeed it was found that the code that generates the metadata for a given file would read the entire file into memory in order to compute the md5. It would crash the interpreter at a file size of between 4.0GB and 4.4GB when reading the file into memory. The solution was to read the file in "chunks", i.e. a buffer, and compute the md5 cumulatively.

(This code was also rather cursed in that it uses synchronous read() and therefore would block the entire server until completion.)

Crash counting lines

After fixing this, a secondary issue was discovered with code which computes the total number of "lines" in a text file. It would attempt to loop through the file, reading one line at a time, counting them. The bug is that in, some admitedly rare cases, but nonetheless encountered with test data, a file may appear to be "text" yet contain no line endings. Such a file would suffere the same issue as the md5 bug - it would crash the server.

The solution here was to include the line counting and "is it a text or binary file" test within the same loop over the file that calculates the md5. Instead of reading "lines", the count is of line endings encountered in each chunk of file within the loop.

So those were the two big bugs that would crash the server with large files.

Crash race condition in globus code

Another doozy was found in the globus processing code. With each upload, a check is made to see if the a special globus file has been created for a user with a KBase globus link. The issue was revealed with concurrency testing, and resulted in the server crashing.

The cause was in globus.py, a race condition in which there it first checks if the user directory exists, and if not, creates it. Another request must have created the directory between the two requests. This would have been revealed when running parallel requests, not just concurrent, as the code itself is synchronous. The solution was simple, don't check for existence first, just attempt to make the directory using exists_ok option.

Although I am generally suspicious of any usage of "check for existence and then take an action that will fail if that assumption becomes invalid", in this case finding it was pure luck, as I decided to run manual tests of concurrent parallel handling by uploading a large directory with the development upload front end while also removing the data directory before each run.

Data corruption potential

Another issue that was not encountered but was apparent through code analysis was the potential for data corruption if more than one file is uploaded with the same name at the same time but different file contents. This was not hard to trigger, under the right conditions. It may seem like an unnatural case, and perhaps would be rare, but one can imagine a user with many files of the same name in different directories which are uploaded at the same time without the subdirectory being recreated on the server.

The solution was to first save to a temporary file, and only to move the file to the final destination after the file upload was complete.

Other fixes

There was quite a bit of rearrangement and repair of code within the /upload handling code in app.py, as well as in metadata.py. There was a lot of duplicated code, which made it difficult to isolate the behavior. For instance, several locations separately handled the logic of generating file metadata if it was found missing.

The primary focus was to consolidate code.

At the same time it is very difficult in Python code to even inspect code if there are violations of coding standards (PEP8 and others). So each file that was touched was subjected to black formatting, pylint errors addressed, pylint exceptions added where necessary.

Secondary - concurrency

A secondary issue was concurrency. As I worked on the code and excercised the runtime with a test harness front end, I could observe two problematic aspects of the service: concurrency and performance.

The service was running on a single threaded aiohttp server, with no front end. This meant that any CPU-bound process, like a blocking read operation, would block the entire server for the duration. In practice this would lead to uploads, downloads, or any other api call stalling for the duration, which could be many 10s of seconds, up to a minute. With the prospect, as well, of the server crashing (even with automatic container restarting in our deployments), requests to the server could completely fail.

So I added gunicorn as a front end. Gunicorn will start a number of workers, each one an aiohttp server. In observations, this does increase concurrency, performance, reliability, and adds parallelism. While uploading concurrent files, I could observe that a blocking operation, like copying a file, would normally not block other uploads. Occasionally it would, and I presume that is because the blocking operation was in the same aiohttp server as the affected uploads. Effective paralellism is enabled because multiple workers could handle uploads at the same time, whereas aiohttp's concurrency is is cooperative.

Secondary - performance

As mentioned above, performance was another concern, as the service seemed to process uploads much more slowly than it should. When I switched the upload file system destination to the container's filesystem, I found that the rate went up from about 20MB/s to 70MB/s, and after adding in the necessary file copy, 65MB/s. Of course, this could be an artifact of docker running on macOS ... there are a number of possible factors.

I also increased the read buffer size from the standard 8K to 1M. This also increased performance, but not much above the high water mark.

Shane suggested that I add in an environment variable switch to allow either strategy.

After making it easy to evaluate both strategies, I found that, due to the buffer size increase, there is very little difference between the two. I think this is due to the fact that the number of writes is reduced by 125x, and I have suspected that each write to a non-local directory carries a significant overhead.

Secondary - code quality fixes

In response to local linting errors, and to code quality, formatting, and security issues reported at GitHub and by a local sonarlint vsc plugin, there were many small fixes made. This was partly exacerbated by some cross-file fixes, which triggered those files to be caught. One such change was the the ubuquitous "Path" class, which conflicted with the standard library's Path class. In the code the Python Path was aliased to PyPath or PathPy, but only in files where there was a conflict, and our own Path class was allowed to use that name. That is very confusing, and opposite good practice, so was fixed by allowing Python path to have it's normal name, and our Path renamed to StagingPath, which also happens to reflect it's scope, as it is not a general purpose Path class. Even though a small change, this forced those files to be reformatted and linting errors fixed. Extra work for the PR, but good in the end.

Testing

Before testing could begin, there were several broken tests. Two tests were broken initially, and another broke with my changes. After fixing the tests, I found that coverage did not go above 50%, and after adding the html report for coverage, found that coverage was broken. E.g. the reports did not correctly highlight missing lines of code.

This required an update to the dependencies, which were indeed quite out of date. After this, test coverage jumped to over 80%, and the html coverage report worked perfectly.

Tests were updated to include newly covered code, although at this point, I'm sure a few more tests are needed to cover changes made since then, a few days ago.

Although some tests do use the aiohttp test server library to simulate server endpoints, and an http client for calling them, I think actual integration tests against a running server would be very useful in the future, especially for exercising scenarios under load, concurrency conditions, etc. within the production Docker image and container.

Developer tools

Finally, I added a directory developer containing tools I've used via a Python container. I wanted to run the tests under conditions closer to the server - so in an equivalent container. I chose to use a different Dockerfile as the testing container, also used to run other Python scripts, does not need as much support and setup - just the python dependencies - and uses a different entrypoint mechanism.

eapearson commented 1 year ago

Lots of fixes for code quality tools. The remaining issues (code smells) are all related to test data. The one remaining security issue relates to the image being left to run as root. @bio-boris opinion? It runs as non-root for me in tests, upload, download, delete, list work fine, but there is a lot more in the service. I really will work on the PR description next.

bio-boris commented 1 year ago

Its possible that there will be some permissions issues if it is run as non root as NFS is mounted in as a specific UID/GID

bio-boris commented 1 year ago

It's way harder to read this PR when there are so many linting changes. Could you leave comments near where there are actual changes or do the linting in a separate PR?

sonarcloud[bot] commented 1 year ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 13 Code Smells

No Coverage information No Coverage information
0.3% 0.3% Duplication

bio-boris commented 1 year ago

Does this PR need a rebase?