Closed airalcorn2 closed 1 year ago
The issue seems to be that I was reading the data from an external hard drive. The first time a file is read from the external hard drive is extremely slow. However, the next time time the file is read (as long as the external hard drive hasn't been unplugged), the file is processed quickly as expected, so the file seems to be cached somehow. I didn't realize this was the case because I had been playing with the data for a while without ever unplugging the hard drive. I'm going to leave this open for the time being and verify that moving all the data locally results in all of the files being processed quickly.
@airalcorn2 Thank you for the report. I attempted to reproduce this issue using two large (~5gb) pcap files, enumerating all scans from each in an infinite loop. However, I observe a more or less constant timing between runs.
OS-1-128_v3.0.1_2048x10_20230216_143245
49: 0.56 s
99: 0.56 s
...
1049: 0.56 s
1099: 0.56 s
OS-0-128_v3.0.1_2048x10_20230216_173241
49: 0.57 s
99: 0.56 s
...
1249: 0.56 s
1299: 0.56 s
(Similar output occurs over successive runs for these two files.)
Can you tell me a little more about your setup? In particular, are you using an HDD or an SSD for storage? In either case, what is the external drive's interface (e.g. USB 2.0, 3.0, or other?)
I confirmed that first moving the files locally does not result in any slowdowns when processing the files. However, the slowdown is 100% repeatable when reading the data from the external hard drive, which is an SSD connected with USB 2.0. If I unplug the hard drive, plug it back in, and run the script: I get this output:
OS-1-128-122213000577-2048x10_WAT0-D-N-FR
49: 4.67 s
99: 3.02 s
149: 4.48 s
199: 3.03 s
249: 3.06 s
299: 4.53 s
I hit CTRL-C after 299. When I immediately re-run the script, this is what I get:
OS-1-128-122213000577-2048x10_WAT0-D-N-FR
49: 0.27 s
99: 0.26 s
149: 0.25 s
199: 0.25 s
249: 0.25 s
299: 0.25 s
349: 1.16 s
399: 4.48 s
449: 3.04 s
You can see that all of the frames before I hit CTRL-C are iterated over very quickly, but then it immediately slows down, so the data must be getting cached between runs.
I have a set of PCAP files that I'm trying to process. While iterating over the
Scans
objects for the first few files is very fast, shortly after that, iterating over aScans
object suddenly becomes extremely slow. It often happens in the middle of iterating over theScans
object for a file, and then processing all subsequent PCAPs is similarly slow. All of the PCAPs are of a similar size. Further, if I process a PCAP that was slow in the script separately on its own, it's fast. The timings here were produced by the script below.