Open rrybarczyk opened 4 years ago
The stack trace is profiled using flamegraph for both the download
and find-bottleneck
command for rrc01, 02, 03, 14, 17 (which has an invalid gzip header), and 18.
What flamegraphs tells us is what functions are on the stack the most.
The results are commited to the profile branch in the flmg
directory and added the commands I used in the justfile
so they can be easily called again.
Regarding flmg/fb-1-2-3-14-17-18.svg
:
Expectedly, we see that asmap-rs::find_bottleneck::FindBottleneck::locate
and its accumulated child calls make up the majority of the callchain.
The most common descendants on the stack are (this does not and should not equal 100% since the children contribute to their parents percentages):
asmap-rs::find_bottleneck::FindBottleneck::locate, 97.10%
├── asmap_rs::find_bottleneck::FindBottleneck::match_rib_entry, 32.82%
│ └── asmap_rs::as_path_parser::AsPathParser::parse, 20.72%
└── mrt_rs::Reader<T>::read, 45.24%
└── mrt_rs::records::tabledump::RIBEntry::parse, 39.93%
└── inflate (from flate2 GzDecoder called inside the mrt_rs crate),17.97%
Conclusion: These are the areas we should to look into optimizing.
Thoughts and proposed steps forward:
mrt_rs
related calls are in another crate, we cannot improve this behavior unless we rip out this dependency and replacing it with our own parser that skips over parsing anything other than the IP/mask and their associated AS paths (which I am TOTALLY game for). mrt_rs
makes a call to flate2
's inflate
function to decompress the gz files. This surprised me at how much time is spent there. No matter what, we need this inflate
call even if we don't use mrt-rs
. What about moving the decompression functionality into the download
command in an attempt to spread out the load that the user sees per command?mrt-rs
(which I think we should) AsPathParser
will be absorbed into the new logic that we write. In terms of making AsPathParser
faster, do you have any ideas? I think my implementation is a reasonable way to do it, but am open to any suggestions that would noticeably increase efficiency here. I will continue to look into it as well.Would love to hear your thoughts and get feedback, @naumenkogs.
Did compiler optimize the code for these measurements?
Would be great to add these results to some README file with explanations. I think this is good enough. I think we should still focus on RAM for now though.
Making AsPathParser
seem to be a waste of time now, because the actual algorithm takes very small.
It's crazy how much mrt-rs takes, and then the second use is AsPathParser::parse, so, just, reading BGP records... Maybe we can do some optimizations in this method. I'll take a look sometime.
Did compiler optimize the code for these measurements?
Would be great to add these results to some README file with explanations. I think this is good enough. I think we should still focus on RAM for now though.
Yes, it is release because flamegraph defaults to release.
Making
AsPathParser
seem to be a waste of time now, because the actual algorithm takes very small. It's crazy how much mrt-rs takes, and then the second use is AsPathParser::parse, so, just, reading BGP records... Maybe we can do some optimizations in this method. I'll take a look sometime.
Can you elaborate on these two sentences? They seem to be conflicting because the first sentences says AsPathParser
is a waste of time, but then the second one acknowledges that we should optimize here.
I have been thinking about the right path forward when it comes to optimization...
When it comes to what we are optimizing for, we said that we are trying to get the RAM usage lower. However, this is still vague on my end. I want to make sure we are optimizing for a specific case. There are many, many possibilities for what systems end users will be running this program on. The systems could be anything: a 2gb machine vs a 4gb machine vs a 4gb plus tons of swap, etc.
Right now, it seems like there are other things that may be more valuable. Like writing more tests for correctness and getting people to use it. Once people use it, we can figure out what (if any) problems they are encountering and then optimize for those specific issues.
For immediate actions, I think we should write more tests to ensure correctness. This way we ensure we are not introducing any bugs as we optimize.
That being said, I want to inflate the gz files in the download
command because it is simple "optimization" that for any system will yield a faster runtime.
In terms of loading in the data in chunks in an attempt to reduce the RAM usage, it might not be the right optimization to try first, but I am still curious to see the difference in the resulting flamegraphs so we can still go for it.
Can you elaborate on these two sentences? They seem to be conflicting because the first sentences says AsPathParser is a waste of time, but then the second one acknowledges that we should optimize here.
I said algorithm, meaning all the logical loops (as opposed to parse method which just converts strings into numbers). The latter may be worth optimizing, the former is too small in the overall run time to bother.
When it comes to what we are optimizing for, we said that we are trying to get the RAM usage lower. However, this is still vague on my end. I want to make sure we are optimizing for a specific case. There are many, many possibilities for what systems end users will be running this program on. The systems could be anything: a 2gb machine vs a 4gb machine vs a 4gb plus tons of swap, etc.
I would say we are optimizing for reducing runtime. Use of swap is bad because it makes everything much slower, that's why keeping memory within RAM limits. I would target RAM required to compile Bitcoin Core, which is around 2-4Gb. I think this is a reasonable requirement for our paradigm. It's like any reasonable laptop 2010+. My intuition is that other optimizations won't be as impactful as keeping memory usage within RAM (you can check this by running it on 16Gb RAM machine).
You can obviously ignore my suggestion and optimize other things.
Like writing more tests for correctness
Yes.
Once people use it, we can figure out what (if any) problems they are encountering and then optimize for those specific issues.
I wouldn't rely on this. I would be happy if 1,000 people independently use this during 2020 (although pre-made asmap created by a Bitcoin Core maintainer with this program will be shipped with release so that's cool). And of those 1,000, I'd expect maybe 10 reports or something. Most of the people will simply give up if they face any difficulties.
That being said, I want to inflate the gz files in the download command because it is simple "optimization" that for any system will yield a faster runtime.
I'm not sure if that works out, but yeah, sounds good if true.
I said algorithm, meaning all the logical loops (as opposed to parse method which just converts strings into numbers). The latter may be worth optimizing, the former is too small in the overall run time to bother.
I think there still may be some miscommunication here because AsPathParser
is not converting Strings to numbers anywhere, BUT I don't think it really matters in terms of this convo anyways :)
Everything sounds good to me.
Alright, so I was testing the script over all 25 files on this google cloud machine: n1-standard-2 (2 vCPUs, 7.5 GB memory, added 10Gb swap). I think it's reasonable to expect a smooth workflow on this kind of hardware.
After more than 6 hours your script ate all memory so that I couldn't ssh back into the machine. Maybe it would terminate sooner, but since I can't track the progress I gave up. This is the RAM issue I was talking about.
My code, on the other hand, successfully finished after 200 minutes.
Currently asmap-rs takes 15 Gb of RAM, need to get that down to 4Gb.
Rationale from @naumenkogs: "Ideally, any user should have the right to be paranoid and do everything on their own, including constructing their own asmap locally. Obviously, requiring constructing asmaps on the phone is not reasonable, but maybe comparing to compiling Bitcoin Core is reasonable. This takes 2-4Gb of RAM."
For profiling and narrowing down the inefficient code:
benchmark
feature for testscargo flamegraph
(requires bin to work)