mozilla / overscripted

Repository for the Mozilla Overscripted Data Mining Challenge
Mozilla Public License 2.0
74 stars 53 forks source link

Analyses issue #22 and TLD [WIP] #98

Open Soumya0803 opened 5 years ago

Soumya0803 commented 5 years ago

@birdsarah I have submited an initial analysis on #22 and small analyses on the TLDS. I will add more to this. For the TLD folder tld_analysis is the main notebook in which the. others are linked. Please review the work done so far.

birdsarah commented 5 years ago

Your issue_22 notebook has a merge conflict and I had to manually edit it to get it to run.

Soumya0803 commented 5 years ago

Thanks @birdsarah. I will work on all the things you mentioned. I'll get in the practice of keeping my notebook clean, by not adding large amount of data. About Local storage , as I mentioned it is A WIP, local storage values are what I planeed to understand next and find out some meaning.

" This cookie is used to determine and save whether the chat widget is open for future visits" and follow-on claims. Very interesting! But how did you know all this? It's not evident from the data. ''

I found this on their website where the cookies being used were mentioned. i'll try to look more and find its evidence in the data

Soumya0803 commented 5 years ago

Instead of transcribing you could use counter.most_common(10)

To help dask you can do dff.script_netloc.apply(get_end_of_net_loc, meta='O') O is the object type which is what is available in pandas for strings.

Thanks for mentioning these, I will do these changes.

To answer your question "Is the script contributing to fingerprinting everytime it is called or there are specific instances?" I would say the answer is yes because you've used fairly precise heuristics to generate those lists and are more likely to have missed some candidates than got too many false positives.

Thank you for answering this I'll update it in the notebook.

Overall. Really great work. Thanks a lot.

I will work more towards issue22 as the value columns has a lot more information and i''ll have to dig deeper. I'll add what more I'm planing to work on at the end of the notebook to indicate how i'm going to accomplish things.

birdsarah commented 5 years ago

I'll add what more I'm planing to work on at the end of the notebook to indicate how i'm going to accomplish things.

I look forward to that. I'm eager to see your response / thoughts on this question:

You are counting by number of calls, what does that tell you? What are potential biases with these numbers? Would a metric like number of scripts change things?

aliamcami commented 4 years ago

Hi @Soumya0803, is this ready for review?

aliamcami commented 4 years ago

Closing this PR due to lack of activity, please feel free to reopen.

Soumya0803 commented 4 years ago

Hi @aliamcami, I had worked on some of the points mentioned in the review. I look forward to continue working on this PR.

birdsarah commented 4 years ago

Thanks @Soumya0803, I'm sorry for the stagnation. We'll take a look at this and your other PR.