attackerSteamIDs in DF['damages'] rounding error if there is any World or C4 damage done

Siiggyy commented 1 year ago

If you create a DataFrame with parse_json_to_df the df['damages'] sometimes has a rounding error in the attackerSteamID if there is any damage done through world or C4, since the attackerSteamID for those is of Type None. And if you then create the DF from that the Steamids get messed up.

Issue happens on line 594 in file demoparser.py a fix could be to give world a custom attackerSteamID. https://github.com/pnxenopoulos/awpy/blob/ccd9c34366bda0424bf04d3e73a12f22059333c5/awpy/parser/demoparser.py#L594

                for d in r["damages"]:
                    if(d["weapon"] == "World" or d["weapon"] == "C4"):
                        d["attackerSteamID"] = 0
                    new_d = d 
                    new_d["roundNum"] = r["roundNum"]
                    new_d["matchID"] = self.json["matchID"]
                    new_d["mapName"] = self.json["mapName"]
                    damages.append(new_d)

I currently have it implemented that way World or C4 damage get the attackerSteamID 0.

Added my JSON for testing purposes. JSON.zip

sirh3e commented 1 year ago

@Siiggyy i got the same probem here OwO

JanEricNitschke commented 1 year ago

What exactly is the output you are getting vs the output you are expecting. Edit: understood it

demo_parser = DemoParser()
with open(r"D:\CSGO\ML\csgoml\2023-01-20.json", encoding="utf-8") as demo_json:
    demo_data = json.load(demo_json)
demo_parser.json = demo_data
dataframe = demo_parser.parse_json_to_df()
steam_ids_df = set(dataframe["damages"]["attackerSteamID"].unique())
logging.info(steam_ids_df)

steam_ids = set()
if demo_parser.json:
    damages = []
    for r in demo_parser.json["gameRounds"] or []:
        if r["damages"] is not None:
            for d in r["damages"]:
                steam_ids.add(d["attackerSteamID"])
logging.info(steam_ids)

2023-01-20 22:13:35 INFO     {76561198083936288, 76561198201946624, 76561198078944032, 76561198133319168, 76561198397742272, 76561199096388144, 76561198120668208, 76561198262004176, 76561198033174672, 76561198193861488, <NA>}
2023-01-20 22:13:35 INFO     {76561198201946625, 76561198397742267, 76561199096388144, 76561198033174674, 76561198120668211, 76561198262004179, 76561198193861491, 76561198083936281, 76561198133319162, 76561198078944027, None}

But that seems like more than a rounding error. For example 76561198133319162 becomes 76561198133319168

I dont think a manual check on this it the way to go as None values can pop up in multiple places. I'll see if there is a general thing with pandas to handle this.

JanEricNitschke commented 1 year ago

I think i got a fix that solves this problem. Currently what i get is:

2023-01-21 07:46:50 INFO     Index(['tick', 'seconds', 'clockTime', 'attackerSteamID', 'attackerName',
       'attackerTeam', 'attackerSide', 'attackerX', 'attackerY', 'attackerZ',
       'attackerViewX', 'attackerViewY', 'attackerStrafe', 'victimSteamID',
       'victimName', 'victimTeam', 'victimSide', 'victimX', 'victimY',
       'victimZ', 'victimViewX', 'victimViewY', 'weapon', 'weaponClass',
       'hpDamage', 'hpDamageTaken', 'armorDamage', 'armorDamageTaken',
       'hitGroup', 'isFriendlyFire', 'distance', 'zoomLevel', 'roundNum',
       'matchID', 'mapName'],
      dtype='object')
2023-01-21 07:46:50 INFO     tick                  int64
seconds             float64
clockTime            object
attackerSteamID       Int64
attackerName         object
attackerTeam         object
attackerSide         object
attackerX           float64
attackerY           float64
attackerZ           float64
attackerViewX       float64
attackerViewY       float64
attackerStrafe       object
victimSteamID         Int64
victimName           object
victimTeam           object
victimSide           object
victimX             float64
victimY             float64
victimZ             float64
victimViewX         float64
victimViewY         float64
weapon               object
weaponClass          object
hpDamage              int64
hpDamageTaken         int64
armorDamage           int64
armorDamageTaken      int64
hitGroup             object
isFriendlyFire         bool
distance            float64
zoomLevel           float64
roundNum              int64
matchID              object
mapName              object
dtype: object
2023-01-21 07:46:50 INFO     {76561198083936288, 76561198201946624, 76561198078944032, 76561198133319168, 76561198397742272, 76561199096388144, 76561198120668208, 76561198262004176, 76561198033174672, 76561198193861488, <NA>}
2023-01-21 07:46:50 INFO     {76561198201946625, 76561198397742267, 76561199096388144, 76561198033174674, 76561198120668211, 76561198262004179, 76561198193861491, 76561198083936281, 76561198133319162, 76561198078944027, None}

but if i change https://github.com/pnxenopoulos/awpy/blob/ccd9c34366bda0424bf04d3e73a12f22059333c5/awpy/parser/demoparser.py#L600 to return pd.DataFrame(damages, dtype=object) i get

2023-01-21 07:48:32 INFO     tick                object
seconds             object
clockTime           object
attackerSteamID      Int64
attackerName        object
attackerTeam        object
attackerSide        object
attackerX           object
attackerY           object
attackerZ           object
attackerViewX       object
attackerViewY       object
attackerStrafe      object
victimSteamID        Int64
victimName          object
victimTeam          object
victimSide          object
victimX             object
victimY             object
victimZ             object
victimViewX         object
victimViewY         object
weapon              object
weaponClass         object
hpDamage            object
hpDamageTaken       object
armorDamage         object
armorDamageTaken    object
hitGroup            object
isFriendlyFire      object
distance            object
zoomLevel           object
roundNum            object
matchID             object
mapName             object
dtype: object
2023-01-21 07:48:32 INFO     {76561198201946625, 76561198397742267, 76561199096388144, 76561198033174674, 76561198120668211, 76561198262004179, 76561198193861491, 76561198083936281, 76561198133319162, 76561198078944027, <NA>}
2023-01-21 07:48:32 INFO     {76561198201946625, 76561198397742267, 76561199096388144, 76561198033174674, 76561198120668211, 76561198262004179, 76561198193861491, 76561198083936281, 76561198133319162, 76561198078944027, None}

so now it doesnt do any weird conversions. however the dtypes of all the columns is now object. So if someone was making use of that they would get thrown off. @pnxenopoulos not sure how to go about this.

JanEricNitschke commented 1 year ago

I dug a bit deeper and its seems that this is a known issue with pandas. It is actually using the correct nullable integer type but there is a casting to float going on behind the scene that causes this issue.

See https://github.com/pandas-dev/pandas/issues/26259 and https://github.com/pandas-dev/pandas/issues/32134 for examples.

It seems the issued were fixed literally 2 days ago in a MR https://github.com/pandas-dev/pandas/pull/50757.

Sadly version 1.5.3 was released exactly a day before. So this issue will probably get fixed in the next pandas release which we should then upgrade to.

So we should probably decide on what the best workaround is until then.

pnxenopoulos commented 1 year ago

We could see a new pandas release probably in anywhere from 1-3 months from now. To address this issue, we could enforce a steamid of 0, like @Siiggyy does. I believe I might have done this before. Another option is to actually assign a steamid through some logic. For example, world damage could go to attacker and C4 damage to the bomb planter (not really sure I like this, though, plus, I don't know what causes world damage).

How about for awpy 1.2.3 we change the golang code to return an attacker steamid of 0 in world/c4 damages?

Siiggyy commented 1 year ago

I wouldn't change the damage to go to the attacker or bomb planter since that would probably screw with a lot of statistics. My guess would be world damage is like falling of a building, maybe falldamage.

JanEricNitschke commented 1 year ago

I also wouldn't try to assign these DMG events to someone.

And I also think it is fine to have no attacker steamid when the DMG is not from a character. I think bots get steamid 0 and that change would make it harder to differentiate.

I feel we should manually set it to 0 for now until there is a new pandas version that includes a fix. At tust point I think we should switch back to the current syntax.

@Siiggyy would just have to be aware that he can't relay on the steamid always being an int then.

Siiggyy commented 1 year ago

You could also take another placeholder if bots get steamid 0. And it would be fine if it won't be an int as long as the id is correct in the end then i could still convert it afterwards.

JanEricNitschke commented 1 year ago

I think for a temporary placeholder it should be fine as 0 even with the collision with the bots. although -1 (does that work?) or 1 would also be fine and maybe better. I was just referring to xenos idea to adjust it in awpy in general.

I think the ideal state would be the current one without pandas bugging out.

Siiggyy commented 1 year ago

Similar bug happens with the Kills df my guess currently is it happens if a player disconnects while he is alive since it counts as a death (suicide and teamkill) but attacker SteamID is NaN. And with that we get the same conversion error again.

Temporary fix:

for k in r["kills"]: if k["attackerSteamID"] is None: k["attackerSteamID"] = 99

JanEricNitschke commented 1 year ago

Pandas 2.0 is out: https://pypi.org/project/pandas/ https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html

Could you check that this now works without issue?

In that case we can just update the requirements.

JanEricNitschke commented 1 year ago

Might need a small change to use arrow types there:

Missing values
Many pandas users must have experienced data type changing from integer to float implicitly. That's because pandas automatically converts the data type to float when missing values are introduced during calculation or include in original data:

python In [1]: pd.Series([1, 2, 3, None]) Out[1]: 0 1.0 1 2.0 2 3.0 3 NaN dtype: float64

Missing values has always been a pain in the ass because there're different types for missing values. np.nan is for floating-point numbers. None and np.nan are for object types, and pd.NaT is for date-related types.In Pandas 1.0, pd.NA was introduced to to avoid type conversion, but it needs to be specified manually by the user. Pandas has always wanted to improve in this part but has struggled to do so.

The introduction of Arrow can solve this problem perfectly: ``` In [1]: df2 = pd.DataFrame({'a':[1,2,3, None]}, dtype='int64[pyarrow]')

In [2]: df2.dtypes Out[2]: a int64[pyarrow] dtype: object

In [3]: df2 Out[3]: a 0 1 1 2 2 3 3 <NA> ```

From here: https://www.reddit.com/r/Python/comments/12b7w3y/everything_you_need_to_know_about_pandas_200/

pnxenopoulos / awpy

attackerSteamIDs in DF['damages'] rounding error if there is any World or C4 damage done #213