Optimize - Githubissues

rrybarczyk commented 4 years ago

Currently asmap-rs takes 15 Gb of RAM, need to get that down to 4Gb.

Rationale from @naumenkogs: "Ideally, any user should have the right to be paranoid and do everything on their own, including constructing their own asmap locally. Obviously, requiring constructing asmaps on the phone is not reasonable, but maybe comparing to compiling Bitcoin Core is reasonable. This takes 2-4Gb of RAM."

For profiling and narrowing down the inefficient code:

[ ] Use Rust's benchmark feature for tests
[x] Use cargo flamegraph (requires bin to work)

rrybarczyk commented 4 years ago

The stack trace is profiled using flamegraph for both the download and find-bottleneck command for rrc01, 02, 03, 14, 17 (which has an invalid gzip header), and 18.

What flamegraphs tells us is what functions are on the stack the most.

The results are commited to the profile branch in the flmg directory and added the commands I used in the justfile so they can be easily called again.

Regarding flmg/fb-1-2-3-14-17-18.svg:

Expectedly, we see that asmap-rs::find_bottleneck::FindBottleneck::locate and its accumulated child calls make up the majority of the callchain.

The most common descendants on the stack are (this does not and should not equal 100% since the children contribute to their parents percentages):

asmap-rs::find_bottleneck::FindBottleneck::locate, 97.10%
├── asmap_rs::find_bottleneck::FindBottleneck::match_rib_entry, 32.82%
│   └── asmap_rs::as_path_parser::AsPathParser::parse, 20.72%
└── mrt_rs::Reader<T>::read, 45.24%
     └── mrt_rs::records::tabledump::RIBEntry::parse, 39.93%
          └── inflate (from flate2 GzDecoder called inside the mrt_rs crate),17.97%

Conclusion: These are the areas we should to look into optimizing.

Thoughts and proposed steps forward:

Since any mrt_rs related calls are in another crate, we cannot improve this behavior unless we rip out this dependency and replacing it with our own parser that skips over parsing anything other than the IP/mask and their associated AS paths (which I am TOTALLY game for).
mrt_rs makes a call to flate2's inflate function to decompress the gz files. This surprised me at how much time is spent there. No matter what, we need this inflate call even if we don't use mrt-rs. What about moving the decompression functionality into the download command in an attempt to spread out the load that the user sees per command?
If we chose to take out mrt-rs (which I think we should) AsPathParser will be absorbed into the new logic that we write. In terms of making AsPathParser faster, do you have any ideas? I think my implementation is a reasonable way to do it, but am open to any suggestions that would noticeably increase efficiency here. I will continue to look into it as well.

Would love to hear your thoughts and get feedback, @naumenkogs.

naumenkogs commented 4 years ago

Did compiler optimize the code for these measurements?

Would be great to add these results to some README file with explanations. I think this is good enough. I think we should still focus on RAM for now though.

naumenkogs commented 4 years ago

Making AsPathParser seem to be a waste of time now, because the actual algorithm takes very small. It's crazy how much mrt-rs takes, and then the second use is AsPathParser::parse, so, just, reading BGP records... Maybe we can do some optimizations in this method. I'll take a look sometime.

rrybarczyk commented 4 years ago

Did compiler optimize the code for these measurements?

Would be great to add these results to some README file with explanations. I think this is good enough. I think we should still focus on RAM for now though.

Yes, it is release because flamegraph defaults to release.

Making AsPathParser seem to be a waste of time now, because the actual algorithm takes very small. It's crazy how much mrt-rs takes, and then the second use is AsPathParser::parse, so, just, reading BGP records... Maybe we can do some optimizations in this method. I'll take a look sometime.

Can you elaborate on these two sentences? They seem to be conflicting because the first sentences says AsPathParser is a waste of time, but then the second one acknowledges that we should optimize here.

I have been thinking about the right path forward when it comes to optimization...

When it comes to what we are optimizing for, we said that we are trying to get the RAM usage lower. However, this is still vague on my end. I want to make sure we are optimizing for a specific case. There are many, many possibilities for what systems end users will be running this program on. The systems could be anything: a 2gb machine vs a 4gb machine vs a 4gb plus tons of swap, etc.

Right now, it seems like there are other things that may be more valuable. Like writing more tests for correctness and getting people to use it. Once people use it, we can figure out what (if any) problems they are encountering and then optimize for those specific issues.

For immediate actions, I think we should write more tests to ensure correctness. This way we ensure we are not introducing any bugs as we optimize.

That being said, I want to inflate the gz files in the download command because it is simple "optimization" that for any system will yield a faster runtime.

In terms of loading in the data in chunks in an attempt to reduce the RAM usage, it might not be the right optimization to try first, but I am still curious to see the difference in the resulting flamegraphs so we can still go for it.

naumenkogs commented 4 years ago

Can you elaborate on these two sentences? They seem to be conflicting because the first sentences says AsPathParser is a waste of time, but then the second one acknowledges that we should optimize here.

I said algorithm, meaning all the logical loops (as opposed to parse method which just converts strings into numbers). The latter may be worth optimizing, the former is too small in the overall run time to bother.

When it comes to what we are optimizing for, we said that we are trying to get the RAM usage lower. However, this is still vague on my end. I want to make sure we are optimizing for a specific case. There are many, many possibilities for what systems end users will be running this program on. The systems could be anything: a 2gb machine vs a 4gb machine vs a 4gb plus tons of swap, etc.

I would say we are optimizing for reducing runtime. Use of swap is bad because it makes everything much slower, that's why keeping memory within RAM limits. I would target RAM required to compile Bitcoin Core, which is around 2-4Gb. I think this is a reasonable requirement for our paradigm. It's like any reasonable laptop 2010+. My intuition is that other optimizations won't be as impactful as keeping memory usage within RAM (you can check this by running it on 16Gb RAM machine).

You can obviously ignore my suggestion and optimize other things.

Like writing more tests for correctness

Yes.

Once people use it, we can figure out what (if any) problems they are encountering and then optimize for those specific issues.

I wouldn't rely on this. I would be happy if 1,000 people independently use this during 2020 (although pre-made asmap created by a Bitcoin Core maintainer with this program will be shipped with release so that's cool). And of those 1,000, I'd expect maybe 10 reports or something. Most of the people will simply give up if they face any difficulties.

That being said, I want to inflate the gz files in the download command because it is simple "optimization" that for any system will yield a faster runtime.

I'm not sure if that works out, but yeah, sounds good if true.

rrybarczyk commented 4 years ago

I said algorithm, meaning all the logical loops (as opposed to parse method which just converts strings into numbers). The latter may be worth optimizing, the former is too small in the overall run time to bother.

I think there still may be some miscommunication here because AsPathParser is not converting Strings to numbers anywhere, BUT I don't think it really matters in terms of this convo anyways :)

Everything sounds good to me.

naumenkogs commented 4 years ago

Alright, so I was testing the script over all 25 files on this google cloud machine: n1-standard-2 (2 vCPUs, 7.5 GB memory, added 10Gb swap). I think it's reasonable to expect a smooth workflow on this kind of hardware.

After more than 6 hours your script ate all memory so that I couldn't ssh back into the machine. Maybe it would terminate sooner, but since I can't track the progress I gave up. This is the RAM issue I was talking about.

My code, on the other hand, successfully finished after 200 minutes.

rrybarczyk / asmap-rs

Optimize #10