package download optimization POC

MarkPflug commented 3 years ago

This is a proof of concept (very rough) of an optimization to avoid downloading and caching entire packages. It uses the structure of the zip archive and HTTP range requests to only download "slices" of the zip package instead of downloading the whole thing.

It seems that in a lot of these large packages the "bloat" is due to large native binaries that fuget doesn't do anything with. No reason to download and cache those, unfortunately with the current approach everything comes along for the ride.

The bulk of the changes are in PackageData.cs, and replace usage of the ZipArchive class with manually requesting sub-ranges of the package to retrieve just the "table of contents". This requires more HTTP roundtrips, but the overall data transferred should be tiny by comparison.

There are a couple other changes to lazily access streams only when needed. Newtonsoft.JSON, for example, carries 9 distinct copies of the dll and xml for the various frameworks it supports. Rather than loading each of those eagerly, I tried to make the access happen on-demand.

It is worth point out that there is currently zero caching happening in this implementation, beyond holding on to the table of entries for each package. Each request for a dll or xml is going over the network. Obviously needs to be fixed.

Will this fix the OOM issue being discussed in #108? I'm not sure. After caching is applied to this I think it might still be susceptible to LOH fragmentation, but I suspect it will, at the least, be greatly reduced.

There are various to-dos and known-issues that need to be considered, but I thought this proof of concept would be place to start a discussion. Let me know what you think.

loic-sharma commented 3 years ago

Hey FYI you may be able to leverage the MiniZip project to do this: https://github.com/joelverhagen/minizip

This was created by @joelverhagen from the nuget.org team to analyze the .NET ecosystem, so it's been battle-tested!

joelverhagen commented 3 years ago

Thanks for the heads up @loic-sharma. @MarkPflug, we meet again 😅.

Love the idea. I've worked a lot in the space and have some thoughts.

Primary concern is that we should definitely think about inflating download counts on NuGet.org. Each range request is an individual GET so this will multiply download count increments. In my own code I've put a user agent to exclude my requests from download counts. This should be considered here.

I don't know the current user agent used by FuGet but I hope it is something identifiable allowing NuGet.org stats pipeline to label this traffic and even consider it for exclusion from download counts. I think this needs to be an open dialogue because perhaps we want to see FuGet gallery and stats reports but I feel that we should probably not have the number multiplied by the number of DLLs in the package 😀.

Regarding MiniZip (https://github.com/joelverhagen/MiniZip), right now it does not support compressed files (I only needed to fetch the signature file piecemeal) and buffers from the end. Both of these things could change but it's not a drop in replacement at this point. Happy to take contributions 👍.

Finally, I have done in depth analysis of big nupkg files, their popularity, and their composition (native assemblies or not). It would be worth looking through this data I have to project how effective this will be in % byte savings. Right now the data is not in a sharable form but it could be somewhat easily.

Is there list of top package IDs that are OOMing?

MarkPflug commented 3 years ago

Hello again Joel.

Thanks for your insights regarding the impacts this might have on nuget. I hadn't even considered that. Should be easy enough to find a solution that doesn't skew the download data that's collected.

I'm definitely interested in reviewing your MiniZip codebase. There were a few concerns I had about the PK spec that left me worried that this might not always work. The most obvious being section 4.3.16 (end of central directory record). This is the final record in the file, and the final field is a variable sized field, ".ZIP file comment". Currently, my implementation expects this comment to be zero length, so this final record is exactly 22 bytes long. I suspect that the number of nuget packages that would have a value in this comment field is approximately zero, but I don't know that for sure. The other issue is that I currently ignore the "Zip64" part of the spec, honestly haven't even read it. I don't know if it would apply to nuget packages or if it is specific to zip files that exceed the 2GB, which I'd assume nuget wouldn't allow? Is there a size limit for nuget? I'm also reading file names using ASCII encoding. The spec (the part that I actually read) wasn't super clear about this, so I just punted for now and figured the vast majority of filenames would only be ASCII. Good enough for a POC. Anyway, the point is, there are a lot of open questions about what needs to be done to make this a robust solution. Sounds like your MiniZip project might have some answers.

What kinds of validations does nuget perform when packages are uploaded? Is that code open-sourced somewhere? Given that most packages are probably created via dotnet pack I imagine that they are pretty consistent and that odd-ball, hand-crafted, zip files are rare.

Is there list of top package IDs that are OOMing?

Not that I'm aware of. I looked at a handful that were mentioned in the various OOM issues: #138, #108, #48. They all seem to mention large nuget packages: https://www.fuget.org/packages/FluentMigrator.Console https://www.fuget.org/packages/AWSSDKCPP-APIGateway https://www.fuget.org/packages/GDAL.Native

joelverhagen commented 3 years ago

.ZIP file comment

I had this analysis at some point but unfortunately I don't have the data or the results anymore. My memory is very poor in this regard. In my https://github.com/joelverhagen/ExplorePackages project I have all central directories stored in Azure Table storage so I could find this out pretty easily. At this point we should assume there are some ZIP comments. AFAIK there is nothing in the NuGet client or NuGet.org ingestion pipeline to reject ZIP comments. In MiniZip I have a expanding buffer (expanding from the end of the ZIP) technique to handle this comment issue as well as large central directories.

Some packages have tens of thousands of files so dealing with big central directories is a legitimate issue, as an aside.

The other issue is that I currently ignore the "Zip64" part of the spec, honestly haven't even read it.

ZIP64 is a complexity nightmare in the ZIP spec if you ask me. Fortunately, we block ZIP64 on NuGet.org due to a package signing requirement. You should be fine there.

Note that in the general ZIP world it is technically possible to have a ZIP64 archive that is smaller than 2 GB. You can make a normal 1-2 MiB ZIP that uses ZIP64 markers and a lot of older, general ZIP readers will behave just fine.

Feel free to look through my MiniZip test data if you want to test this implementation with oddballs: https://github.com/joelverhagen/MiniZip/tree/master/MiniZip.Test/TestData

I'm also reading file names using ASCII encoding.

For a while (since NuGet 3.x timeframe IIRC) ZIP file paths have been URL encoded by official pack tooling. This supports your ASCII implementation. But on the NuGet team we are well aware that folks have less common requirements sometimes and perform the pack step on their own, or even using manual tools like 7-Zip. during or after the production of the initial .nupkg. So we can't depend on that assumption IMO. This is the approach I use in MiniZip and it is able to read all file names in all NuGet.org .nupkgs: https://github.com/joelverhagen/MiniZip/blob/21b1e458e38690319d7257cecb4e08de338169cf/MiniZip/ZipEntryExtensions.cs#L50-L62

What kinds of validations does nuget perform when packages are uploaded? Is that code open-sourced somewhere?

We have two categories of validations. Synchronous (completed during nuget.exe push PUT request) and asynchronous (completed shortly after push and prior to the package becoming available).

Synchronous validations are driven by NuGetGallery. The flow starts here: https://github.com/NuGet/NuGetGallery/blob/a79b0f0460a110c4b6fb6edae95ff4bf7fe11263/src/NuGetGallery/Controllers/ApiController.cs#L535

Asynchronous validations are driven by our validation orchestrator. README here: https://github.com/NuGet/NuGet.Jobs/blob/master/src/NuGet.Services.Validation.Orchestrator/README.md

Given that most packages are probably created via dotnet pack I imagine that they are pretty consistent and that odd-ball, hand-crafted, zip files are rare.

I think in general you are correct. Our pack tooling has improved a lot over the years so most scenarios are supported leading to more consistent ZIP shapes. As a lower bound of hand crafted packages, I did a query in my data set (thanks again for the contribution @loic-sharma!) and it looks like 0.05% of packages don't a [Content_Types].xml file, which is an OPC marker. If this file is not present then we can confidently say the package was not produced using official tooling. But that's a lower bound and doesn't account for any postprocessing done on a dotnet pack/nuget.exe pack-produced .nupkg that could make things "weird". Another question is do we care about this small percent (>= 0.05%). If you weight by download count it seems like even less of a problem:

So, definitely good enough for POC I think 😃.

Another angle I want to bring up is that there are a LOT of native assemblies in lib/ or build/ that will fit into FuGetGallery's criteria for this flow (https://github.com/praeclarum/FuGetGallery/blob/037f80cc334d66847c57b637b2c135678e737c38/Data/PackageData.cs#L125-L126). Also, these assemblies skew larger. In other words, a good number of the more expensive range requests will result in no valuable output from the Mono.Cecil decompilation. Here are some percentiles on assembly sizes grouped by managed vs. unmanaged assemblies.

The point I'm trying to get at here is that even with range requests you don't know in advance whether the thing you're downloading is even useful for the FuGet web UI. So perhaps the increased number of round trips plus the chance of getting an uninteresting assembly will change the picture or set of solution hypothesis for the OOM problem. If you have the full .nupkg on disk/in-memory you can perhaps do clever and fast seeking on the .dll to determine if it should be analyzed.

I wanted to bring this to your attention to consider, but I don't know if it changes the value proposal here.

One big picture option for the POC is to write a backend job that follows the NuGet.org catalog and captures all of the metadata needed for the web UI and put it in a persistent store (e.g. Azure Blob Storage). Then FuGet web UI can just read this data without concern for OOM. Fundamentally this is the design that NuGet.org uses for all of the api.nuget.org assets ... as well as my ExplorePackages project. Of course this is a bigger change in a completely different direction. Maybe I should post on the OOM issue about this 🤔.

MarkPflug commented 3 years ago

Thanks Joel, this is very helpful.

Some packages have tens of thousands of files so dealing with big central directories is a legitimate issue

Do you happen to know of some examples you can share? It would be interesting to test the behavior of some of these. I can't imagine it is any worse than downloading the entire archive. I'd be curious to see how the UI deals with it though.

there are a LOT of native assemblies in lib/ or build/

I don't know if that will be a problem or not. I've tried to modify some of the code paths that were accessing the zip entries eagerly to instead only access them when needed. In the handful of packages I tested this resulted in only one dll and one xml file being requested during initial page load. Once caching (file or memory) is added to those requests I think things should be in a pretty good place. I suspect that a lot of these native binaries will never be requested, and if even if they are, we should still be in a better place than we are today. If you have examples of packages that fit this criteria that would be helpful.

I think you're absolutely right about some more extensive changes that could help with these issues. The immutable nature of nuget packages means that all of this data is 100% cacheable. No need to even hold onto entire assembly images when the caching could be applied to the output of the Cecil processing, or to the generated HTML. Definitely lots of directions to explore, but for now I think this zip hack will show the biggest bang for the buck (keystroke).

joelverhagen commented 3 years ago

Examples: aspose.pub.cpp 20.9.0 - biggest unmanaged assembly microsoft.codecoverage 15.9.0 - most popular unmanaged assembly twilio.voice.ios.xamarinbinding 5.0.0 - biggest managed assembly stocksharp.ecng.xaml.dxrefs 1.0.1 - package with the most managed assemblies (959) videolan.libvlc.uwp 3.3.1 - package with the most unmanaged assemblies (938) leadtools.redist.19 19.0.0 - package with the most of both (556 managed, 85 unmanaged)

Percentiles of assemblies per package:

MarkPflug commented 3 years ago

Thanks for compling a list of packages that people can use to DOS fuget, lol. Jokes aside, having access to this dataset incredible.

aspose.pub.cpp 20.9.0 - biggest unmanaged assembly

This one seems fine. Only the nuspec gets downloaded.

microsoft.codecoverage 15.9.0 - most popular unmanaged assembly

No problem. Nuspec only.

twilio.voice.ios.xamarinbinding 5.0.0 - biggest managed assembly

This one is very problematic. The package includes a managed lib assembly that contains a massive embedded resource which appears to be a java class file (0xCAFEBABE). My feelings on this one is to just put a limit on the size of the dll that will be acceptable to download. That limit could be pretty permissive, say 10MB. We know both the compressed size and decompressed size before hand from reading the zip directory info, so it would be very easy to skip accessing entries that we know will be overlarge. Indeed, one could imagine a maliciously constructed nuget package containing a "zip bomb" that while the entry is small in compressed size might be unreasonable to decompress. I don't know if such a thing could make it past nuget's validations or not, but worth considering. This particular package feels abusive though, and I think packages this like this are going to cause continued problems for fuget stability. It think it would be better to just punt and display a friendly error for packages of this ilk.

stocksharp.ecng.xaml.dxrefs 1.0.1 - package with the most managed assemblies (959)

I couldn't find this package, but there is an "ecng.xaml.dxrefs" that is probably the same thing. This is problematic, but easily fixable. The issue is that in the razor layout it is rendering a "title" for the dll link in the left nav. That title contains name/version of the assembly, which requires loading the assembly. I removed the title from the link as a "fix", at which point only the the nuspec and a single assembly are accessed. This is also interesting because it includes a lot of localized resource dlls. It might be worth considering not rendering those at all in the listing, or collapsing them somehow to only show the language folder, since everything underneath isn't that interesting.

videolan.libvlc.uwp 3.3.1 - package with the most unmanaged assemblies (938)

No problem, nuspec only.

leadtools.redist.19 19.0.0 - package with the most of both (556 managed, 85 unmanaged)

I couldn't find this package. Maybe it's unlisted? I see other "leadtools" packages, but not this one.

praeclarum commented 3 years ago

I like this a lot, thank you @MarkPflug - I can see it saving a lot of memory and download time.

@joelverhagen Do you have recommendation on what to change my user agent to to avoid triggering your download counter too much? I've never wanted the Fuget hits to count as a download so I'm happy to set whatever headers to prevent that.

MarkPflug commented 3 years ago

One of the known issues is that this code doesn't handle zip archives that have comments. I've never seen a zip archive with comments, so I suspect it is unlikely to affect many packages. The standard tooling won't create a package with comments, the ZipArchive class in the framework doesn't support creating a package with comments. However, the SharpZipLib does support comments, so it's possible that some adventurous developer has published a funky nupkg that contains one.

There are a couple TODOs in the code. One having to do with reading filenames using ASCII encoding; not sure that's right. The other seems to relate to resolving the assembly name from the documentation xml. I don't remember the context on this second one, but it seems reasonable the way it works.

Finally, I think I introduced a bug that needs to be fixed: 1) navigate to my experimental instance https://pfluget-test.azurewebsites.net/packages/Newtonsoft.Json 2) Click the "Newtonsoft.Json" under the Namespace listing in the left nav. 3) Notice the link is broken https://pfluget-test.azurewebsites.net/packages/Newtonsoft.Json/13.0.1/lib/netstandard2.0/lib%2Fnetstandard2.0%2FNewtonsoft.Json.dll/Newtonsoft.Json

It appears there might be a double-encoding happening somehow, but it isn't immediately clear to me where that's happening.

joelverhagen commented 3 years ago

@praeclarum, unfortunately, this is an area that's not documented at all. For NuGet.org's own data analysis project (NuGet.Insights) I use the following approach: prefix the user agent with Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; AppInsights)

It's a dirty hack, I know. But it's the best we have today. Please also add some identifying information after this prefix so that we (NuGet.org team) can associate these downloads with FuGet later if necessary. As long as you have that prefix your data will be dropped from the counting but additional information can be helpful for investigations.

We plan to have a more elegant approach in the future (some more descriptive "ignore me" suffix or substring) but it's not defined today.

My user agent for NuGet.Insights is something like this: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; AppInsights) NuGet Test Client/6.0.0 (Microsoft Windows 10.0.19043) NuGet.Insights.Logic.Test.Bot/1.0.0 (X64; X64; .NET Core 3.1.15; +https://github.com/NuGet/Insights)

joelverhagen commented 3 years ago

@MarkPflug, I think the comment case is unimportant. There appears to only be 7 such packages on NuGet.org and their download count is low:

praeclarum commented 3 years ago

Thanks for the quick reply @joelverhagen! I'll get those changes in.

praeclarum / FuGetGallery

package download optimization POC #139