kaito1410 / AutoDD_Rev2

An improved version of the original AutoDD
GNU General Public License v3.0
167 stars 50 forks source link

undercounting of symbols? #25

Closed gobbedy closed 3 years ago

gobbedy commented 3 years ago

I just went through your code, I love it and will try my hand at using it tomorrow.

I just found what seems to be an undercounting of tickers:

        # title_extracted is a set, duplicate tickers from the same title counted once only
        for k in title_extracted:

            if k in all_dict:
                all_dict[k] += increment
            else:
                all_dict[k] = increment

        # avoid counting additional point for the tickers found in the text body
        # only search the text body if ticker was not found in the title
        if len(title_extracted) > 0:
                continue

        for m in selftext_extracted:

            if m in all_dict:
                all_dict[m] += increment
            else:
                all_dict[m] = increment

Doesn't the above miss out on instances when a ticker appears in the selftext but not in the title? (eg say the title contains AAPL and the selftext contains MSFT, it would skip MSFT).

Since rocket_tickers is already a set, shouldn't the above be replaced by the following?

        for k in rocket_tickers:

            if k in all_dict:
                all_dict[k] += increment
            else:
                all_dict[k] = increment

Thanks again for sharing this code.

kaito1410 commented 3 years ago

Doesn't the above miss out on instances when a ticker appears in the selftext but not in the title?

I think the idea is that most people that are posting about a ticker would have it in the title. If the ticker was not found in the title, then we check the selftext. If someone mentions one ticker in the title and another in the body, the code assumes the intention is talking about the ticker that's in the title. If there's a trend of putting a ticker in the title and a different ticker in the body, maybe we can consider counting the ticker in the body.

Since rocket_tickers is already a set, shouldn't the above be replaced by the following?

` for k in rocket_tickers:

        if k in all_dict:
            all_dict[k] += increment
        else:
            all_dict[k] = increment`

Not sure what you mean here, for each ticker that appears in the title + selftext (rocket_tickers), we add some score to all_dict? The rocket count is seperate from the score (all_dict), so not sure what you mean exactly.

gobbedy commented 3 years ago

Ok, thanks for explaining the assumptions in the code. Based on those assumptions, the code is correct.

As for explaining my code, if the intention of the code had instead been to add truly "all" the stock tickers to all_dict (that is, all stock tickers present in either the title or the selftext), then my suggested code would have been correct. (As a reminder of that particular bit of the code, rocket_tickers is a set containing all the tickers in both the title and in the selftext)

The name all_dict lead me to believe the latter was the intention, but your assumptions are perfectly reasonable. Either way, I guess it would rarely make a difference (very few instances where a relevant ticker is mentioned in the selftext when only a different ticker is mentioned in the title)

Since my suggestion actually simplifies the code and makes all_dict truly contain all tickers, it might be worth considering. But like I said your assumptions are reasonable and it probably makes no practical difference anyway.

kaito1410 commented 3 years ago

I agree, I think you could argue for both ways. The rockets count towards both set of title and selftext ticker is a bit of a toss up. I think by the same logic maybe I should only count the rockets towards the title ticker or if no title ticker exist then count them towards the selftext ticker? I'm not 100% sure if this is better or not...

gobbedy commented 3 years ago

Hey kaito, thanks for fixing/closing the other issues!

Regarding this one, actually I have a stupid question. What is a rocket ticker? I understand from the code what it counts (the total number of times a ticker appears in any title or selftext), but what value does it add above the "all_dict" count? And why is it called "rocket"?

This is not a cheeky question meant as a criticism, I'm genuinely wondering. Is it perhaps meant to catch early appearances of a ticker before the reddit community has had time to upvote the relevant submissions?

Also, is it intentional that if a ticker appears multiple times in a title/selftext, it will be be counted multiple times in the rocket_ticker count?

gobbedy commented 3 years ago

closing as the new ticket is more relevant