Analyzer: Add time_bin argument to area_to_pandas

EwoutH commented 2 months ago

This PR introduces time-binned analysis capabilities to the link_to_pandas and area_to_pandas methods in the Analyzer class, while also addressing an issue with traffic volume calculation. These changes allow for more flexible, granular, and accurate analysis of traffic data over time.

Changes

Added time_bin argument to link_to_pandas:
- When time_bin is None, the method behaves as before.
- When time_bin is specified, data is aggregated into time bins.
- Performance is improved for time-binned analysis due to reduced iterations.
Updated area_to_pandas to work with time-binned data:
- Now uses the time-binned data from link_to_pandas when time_bin is specified.
- Calculates area statistics for each time bin.
Fixed traffic volume calculation in area_to_pandas:
- Now correctly counts vehicles entering the area from outside.
- Also counts vehicles starting their trip within the area.
- Avoids double-counting vehicles passing through multiple links within the area.
Data validation:
- traffic_volume now represents the volume per bin instead of cumulative.
- total_travel_time is calculated per bin, which is more logical for a binned approach.

Notes

The test suite has been updated to reflect the changes in traffic volume calculation.
The new approach provides a more accurate representation of traffic patterns over time.

Part of #120.

TODO

[x] Validate data
[x] Add argument to link_to_pandas
[x] Fix traffic volume calculation

EwoutH commented 2 months ago

Okay, it finally starts to look like something.

1. Added `time_bin` argument to `link_to_pandas`

Added the optional argument time_bin argument to link_to_pandas. Default behavior is now restored, and it's a bit faster since it only need to iterate over a lower amount of slices instead of over all individual timesteps.

2. Validated the data.

Previous strategy, with the timestep as in #123 and #128, it looked like this:

With this PR it looks like this:

Note that:

traffic_volume looks like it's higher. This could be a behavior change, since
total_travel_time is now per bin instead of cumulative, which seems more logical when going for a binned approach.

3. Tests fail

The test fails because we now set "area" as the index name (from the inputted area_names). We can either change that or update the tests, since I find "area" quite an logical index.

EwoutH commented 2 months ago

@toruseo This is now ready for an initial review, before I optimize further.

toruseo commented 2 months ago

Thanks for your hard work.

This time_bin approach is very cool! Let's go this way.

The test fails because we now set "area" as the index name (from the inputted area_names). We can either change that or update the tests, since I find "area" quite an logical index.

For this issue, I want to ensure the backward compatibility. Maybe the index name for non time_bin mode can be just an integer (i.e., drop line 1519).

Another related issue is that the index for area_to_pandas is different from the others. This might be confusing. Is it possible to make these (time_bin, area) indexing a optional mode to be enabled by optional argument? Edit: At least, the internal variable Analyzer.df_area should have a consistent data structure. I think something like below would be suitable

s.df_area = pd.DataFrame(result)
if index_by_area:
    if time_bin is not None:
        return s.df_area.set_index(["time_bin", "area"])
    else:
        return s.df_area.set_index(["area"])

Similar indexing mode could be added to link_to_pandas as well.

As a minor improvement, it is better to accept any numbers (including float) as time_bin argument as it is in seconds. Just doing time_bin=int(time_bin) in the beginning will be sufficient.

EwoutH commented 2 months ago

Awesome, I will implement your suggestions later today.

EwoutH commented 2 months ago

Implemented and tested!

EwoutH commented 2 months ago

I don't really get the test error. It seems a value diff this time?

toruseo commented 2 months ago

This seems to be a real issue.

This means that your method says the traffic volume was 9310, whereas the actual number of all vehicles was 6880. Maybe double-counting or something.

Pls verify your method very carefully. Traffic volume and other stats in time_bin mode need to be verified as well. And can you include your test code (based on a small, tractable scenario like one in test_area_stats()) as well?

EwoutH commented 2 months ago

A huge advantage of calculating traffic_volume based on Link data, is that you can turn off vehicle data logging completely and still get those 6 metrics. That can save an order of magnitude in memory usage in both running the simulation and saving data from it afterwards.

I will try to get the count deviation bug fixed this week.

EwoutH commented 2 months ago

@toruseo I fixed the remaining issues (see 2dca14f for some details), all tests now pass!

toruseo commented 1 month ago

can you add test for time_bin mode as I instructed?

Validity of Analyzer must be very carefully checked, because if its output is corrupted, entire results of the research will be seriously damaged.

EwoutH commented 1 month ago

Of course, I understand the importance of tests. I just wanted to check first if you were happy about the implementation before moving forward.

I fixed a nasty bug in which the last time bin was excluded sometimes in f543c34 and added tests in ced235e.

toruseo commented 1 month ago

sorry, I missed your last message. Let me check

toruseo commented 1 month ago

strange. There may be some bugs. I used the following code for testing in a tractable small scale scenario.

from uxsim import *

W = World(
    name="",
    deltan=5, 
    tmax=2000, 
    print_mode=1, save_mode=1, show_mode=1,
    random_seed=0
)

W.addNode("orig", 0, 0)
W.addNode("mid1", 1, 1)
W.addNode("mid2", 2, 2)
W.addNode("dest", 3, 3)
link1 = W.addLink("link1", "orig", "mid1", length=1000, free_flow_speed=20, jam_density=0.2)
link2 = W.addLink("link2", "mid1", "mid2", length=1000, free_flow_speed=20, jam_density=0.2, capacity_out=0.4)
link3 = W.addLink("link3", "mid2", "dest", length=1000, free_flow_speed=20, jam_density=0.2)
t1 = 400
t2 = 800
W.adddemand("orig", "dest", 100, t1, 0.8)
W.adddemand("orig", "dest", t1, t2, 0.4)
W.adddemand("orig", "dest", t2, 1000, 0.1)

W.exec_simulation()

W.analyzer.print_simple_stats()

df = W.analyzer.link_to_pandas(time_bin=500)
display(df)

print(sum(df["traffic_volume"][df["link"]=="link1"]), sum(df["traffic_volume"][df["link"]=="link2"]), sum(df["traffic_volume"][df["link"]=="link3"]))

df2 = W.analyzer.area_to_pandas([["orig","mid1","mid2","dest"]])
display(df2)
print(df2["traffic_volume"])

The total number of vehicles is 420, so print(sum(df["traffic_volume"][df["link"]=="link1"]), sum(df["traffic_volume"][df["link"]=="link2"]), sum(df["traffic_volume"][df["link"]=="link3"])) should print 420 420 420, and print(df2["traffic_volume"]) should print 420. But I got the following results:

Can you check?

Edit: BTW, in the main branch version, W.analyzer.area_to_pandas([["orig","mid1","mid2","dest"]]) return the following dataframe. Thus, this function is very broken.

EwoutH commented 1 month ago

Thanks for getting back.

Unfortunately, I don’t have time to work on this in the near future. If you are able, feel free to continue work on this PR.

My gut tells me it might be either A) some time indexing bug, like not including the first or last time step, B) something to do with when the vehicles get counted on a link, or C) something to do with how the areas are split and links the edges behaving weird,

toruseo commented 1 month ago

okay, I leave this in pending status

toruseo commented 2 weeks ago

This is related to https://github.com/toruseo/UXsim/issues/143#issuecomment-2473026855 But this is too good to close, so I keep it for future updates.

EwoutH commented 2 weeks ago

I don't see me working further on it, but I have successfully used this code in my thesis. So feel free to continue it.

toruseo commented 2 weeks ago

The current error may be small, but I value perfection for this kind of stats computation. I also dont see me working on this, so let me close this one.

toruseo / UXsim