uvacw / inca

24 stars 6 forks source link

Softcosine days #464

Closed mariekevh closed 5 years ago

mariekevh commented 5 years ago

Softcosine_similarity now has include_monday parameter.

Tried several things, including CustomBusinessDays, to merge Saturday and Sunday, but it did not work out with the way we set up our softcosine code. I think this is only useful, like in our news event project, where you have a 3-day sliding window and you want to include articles published on Mondays when the window starts on a Friday (because not much is published on Sundays). That is why I did it as follows:

if days_after == 2 and source_date_object.weekday() == 4:
days_after = days_after+1
#create lists of relevant targets per source

Let me know if there is a better solution...

Additionally found mistake in calculating day_diff. It is now: target date - source date (e.g. '2018-12-07' - '2018-12-08' = -1, as it should be). How this mistake was still in here is a mystery to me...?

FeLoe commented 5 years ago

I have not checked your solution yet (but you are right about the day_diff, I think it was as I implemented it the other way around originally) - but I had another thought: Would it not make our life a lot easier to sort all the results by date, group by date and then do the comparisons to the next two "groups"? This way we could just have something like "ignore Sundays" (so weekday = 6) and throw out every group that is a Sunday. Otherwise if you have days_after = 3 and also want to exclude the Sunday from analysis it is not possible. I am not sure how much work it would be to implement that (and if it is too complicated we should just leave it) but it might be a little cleaner compared to how it is now.

mariekevh commented 5 years ago

Discussed with Felicia today: I am going to change the way we compare dates, which will make everything a lot easier. I will look at this tomorrow. I'll let you guys know when this PR is ready :)

mariekevh commented 5 years ago

@FeLoe I need a little help.

So the idea was to group the docs by date, then compare a group to the next n groups (so dates, such as Saturday and Sunday, can easily be merged). For this to be possible, source and target need to be in the same list of dicts. For now I did the following:

# convert sourcedate and targetdate into datetime objects if not already
            for a in source_query:
                if isinstance(a['_source'][sourcedate], datetime.date) == True:
                    pass
                else:
                    a['_source'][sourcedate]=[int(i) for i in a['_source'][sourcedate][:10].split("-")]
                    a['_source'][sourcedate] = datetime.date(a['_source'][sourcedate][0],a['_source'][sourcedate][1], a['_source'][sourcedate][2])

            for a in target_query:
                if isinstance(a['_source'][targetdate], datetime.date) == True:
                    pass
                else:
                    a['_source'][targetdate] = [int(i) for i in a['_source'][sourcedate][:10].split("-")]
                    a['_source'][targetdate] = datetime.date(a['_source'][targetdate][0],a['_source'][targetdate][1], a['_source'][targetdate][2])

            # sort source_query and target_query by date
            source_query.sort(key = lambda item:item['_source'][sourcedate])
            target_query.sort(key = lambda item:item['_source'][targetdate])

            # group by date
            source_grouped = defaultdict(list)
            for i in source_query:
                source_grouped[i['_source'][sourcedate]].append(i)
            target_grouped = defaultdict(list)
            for i in target_query:
                target_grouped[i['_source'][targetdate]].append(i)

            grouped = defaultdict(list)
            for d in (source_grouped, target_grouped):
                for key, value in d.items():
                    grouped[key].append(value)

However, if a date is not present in either source or target, the grouped will have only one list (instead of two) for that particular date, which makes it impossible to see whether this is source or target... Any ideas how to fix this?

This also raises the question: if the source and target corpus misses certain dates, comparing to the next 2 groups, might not actually be the next two days. It is likely that some days will be missing from the corpus if you only take into account articles that have to contain certain keywords. This makes me think it might not be such a good idea after all. Thoughts?

FeLoe commented 5 years ago

i did not have much time to look at it yet, but one thing might be to just make an additional key in the dicts that says whether its a source or target to preserve that information. And if you make a list of all dates like here: https://stackoverflow.com/questions/993358/creating-a-range-of-dates-in-python and each is a key to an empty list, you could extend the value by the list or have an empty list if nothing extists for that day. Then if you compare those days nothing is compared. Tell me if that sounds too confusing

mariekevh commented 5 years ago

@FeLoe Yes, that is a good idea. Thanks. But honestly I think it might not be a good idea to compare groups instead of dates because of what I said before. Let me know :)

damian0604 commented 5 years ago

Hi both, thanks for all of this! I'm not sure if I completely get the last points of your discussion, but I'm happy to help if that's necessary...

mariekevh commented 5 years ago

I think this is ready! 👍

damian0604 commented 5 years ago

Perfect, will look into it ASAP. So the idea is that if we use windows we create the pajek-file seperately? I guess that's fine, we wanted to remove double edges anyway...

mariekevh commented 5 years ago

Yes, I thought creating double the amount of files is a bit chaotic; it's probably better to do it at once afterwards :)