[FEAT]: Privacy concern: Scraping enabled by default - should be disabled

AquaVirus commented 1 month ago

Description

Do not have "Automatically get info for new videos" activated on install.

Additional information

There is no good reason to have "Automatically get info for new videos" activated on install. Sending information about every media file on your phone to a third-party without prior consent is extremely invasive (spyware-like even) behavior, especially considering how questionable that third-party's privacy policy is (there is no limit to how long they keep API data for and they basically get to do whatever they want with it). I don't know how such a glaring issue has never been pointed out before, and it's probably enough grounds to have the app taken off of F-Droid, if left unaddressed.

Pentaphon commented 1 month ago

Sending information about every media file on your phone to a third-party without prior consent is extremely invasive

But it doesn't do that. It scrapes data for every file that matches a title on TMDB. Your media files are safe and no info is actually sent about them because "VID00004.mp4" is not identifying data nor is that data stored by TMDB.

However, your idea to opt-in rather than opt out with private mode is worth considering.

AquaVirus commented 1 month ago

So, in your example, is "VID00004.mp4" not sent to TMDB at all? It's true that video file names do not contain sensitive information most of the time, but they can contain sensitive information, and that must be accounted for.

mschumacher69 commented 1 month ago

So, in your example, is "VID00004.mp4" not sent to TMDB at all?

No, it just queries the file name for a match, that's it. The data is neither stored nor identifiable, it's completely anonymous.

It's similar to searching for the title on Google or a torrent website.

AquaVirus commented 1 month ago

But that is sending the video's name to TMDB, just as searching for the title on Google would send the title to its servers. The data is linked to your IP, and there is no reason to believe TMDB doesn't keep the data - in fact their privacy policy makes it explicit that they log API usage data. And if you still feel like disagreeing, here's an excerpt directly from Nova Player's privacy.md:

a processed version of the video file names present on the device is sent to themoviedb.org servers

You can see the pre-processing code here: https://github.com/nova-video-player/aos-MediaLib/tree/v6.2/src/com/archos/mediascraper/preprocess The point of the processing is just to make scraping more accurate - there is no effort made to ensure the user's privacy. There is a way to scrape while minimizing privacy concerns: keep a reduced, local copy of TMDB, which file names can be matched against, and only querying against its API if a match is found. No file names would ever be shared with TMDB that way.

mschumacher69 commented 1 month ago

Yes it just sends the title, not the whole file. Your comment made it seem like it sends the whole file or detailed info about the file, it doesn't. It just extracts the title from the file name and queries this title with TMDB.

I'm not sure how feasible would caching the whole TMDB database offline be as you suggested, but I imagine it would take a lot of time and storage to cache everything because it would need to cache posters and backdrops as well. It would also need to update this cache occasionally to stay up to date.

AquaVirus commented 1 month ago

There's no need to keep copies of images - those could be queried using TMDB's API once a match has been found against the local database. A list of movie titles would probably be enough, and it wouldn't take a lot of storage.

mschumacher69 commented 1 month ago

Well, if images are going to be queried using TMDB's API once a match has been found, you might as well just query TMDB's API directly, which is what happens right now.

I mean using your logic, unless most of your video files do not exist on TMDB, TMDB's API is gonna be queried every time to download posters, which would defeat the whole purpose of caching the database for privacy...

AquaVirus commented 1 month ago

No, it wouldn't be the same as querying TMDB with file names directly. Once a match has been found against the local database, a query would be made using the movie's name, not the file name. TMDB still gets to know what movies you have on your device, yes, but that's unavoidable. I do also believe that most video files not being present on TMDB is the case for the majority of people.

mschumacher69 commented 1 month ago

What difference does it make if it queries using the movie name or the file name? They are one and the same...

I rarely come across a video not available on the TMDB and when I do, I add it myself with a few clicks...

AquaVirus commented 1 month ago

Try to keep up with me here:

Let's say you have two media files on your device, one is called "[1080p]The.Godfather[HDRip].mkv", the other is called "356694391.mkv". Let's say, after preprocessing, the file names become "The Godfather" and "356694391", respectively. What the app currently does is use the API to query information on both files to TMDB, which works fine for the first file, but not for the second file, as it's just needless data being fed into TMDB. In my example, the file name does not contain any information one would prefer to keep private, but the point is that it could contain such information.

With my proposed solution, the app would first query the processed file names against the local DB. After finding a match for "The Godfather", it would query the TMDB API to retrieve information about the movie. It wouldn't find a match for the other file, however, and no data about the file would be sent to TMDB's server in this case.

mschumacher69 commented 1 month ago

I see how this can be a problem when using nova on a phone, the reason I didn't realize that is because I use it on my TV where it only indexes the smb shares that I select, there are no local files that it auto indexes.

Anyway, there seems to be a way to de-index local folders as per this reddit post from 5 years ago, but I can't seem to find that option on my phone.

Another way is to disable auto indexing completely under preferences > automatically get info for new videos.

AquaVirus commented 1 month ago

Yes, it can be disabled, but the indexing is enabled by default. So, by the time you disable it, it's already too late and your data has already been harvested.

Pentaphon commented 1 month ago

your data has already been harvested.

Lol what data? Even if you were having your data "harvested", what is the risk of having your auto generated VID0004.mp4 video titles matched with those on TMDB? Your personal video titles are not going to show anything personally identifiable. I do agree with you that @courville could simply not have Nova scrape your videos until you want it to but you are acting as if you are being datamined.

mschumacher69 commented 1 month ago

Exactly, I mean unless your videos are titled sextapewithmywife.mp4 or something like that, why do you care if vid1.mp4 gets sent to tmdb?

AquaVirus commented 1 month ago

Exactly, I mean unless your videos are titled sextapewithmywife.mp4 or something like that, why do you care if vid1.mp4 gets sent to tmdb?

Would you be still be ok with it if Nova was uploading every file on your device (not just file names) to a third party? I mean, most files don't contain anything particularly interesting, so we could apply the same argument you're making. The fact that it's unlikely that a video's file name will contain personal information does not make Nova's careless handling of it justified in any way, especially with TMDB being a literal data harvesting operation. Have you looked at their Privacy Policy? It's disgusting. They actively censor criticism of it on their forums, too. If you still want to argue that it's no big deal, then I'd like to see you list the names of every media file on every device you own right here. I could show you how much information can be inferred based on supposedly innocuous file names alone.

mschumacher69 commented 1 month ago

Again, nova is not uploading the file itself, it just parses the filename and uploads the title that it extracts from the filename. That filename has got no personal info and I don't care if it gets uploaded to TMDB.

Media files currently on my phone are screen-xxxxxx.mp4, VID-20240125-WAxxxx.mp4, PXL_20231105_xxxx.mp4, I mean I'm not gonna list all of them, but they are all like that and those filenames are meaningless...

courville commented 1 month ago

@AquaVirus you are correct, we could disable automatic scanning and indexing of internal storage which would avoid exposing pre-processed video filenames to TMDB. However the impact in terms of feature discoverability and user friendliness of media indexing would be important on users. Nova is designed to process video libraries and its benefit is to provide a turnkey out of box experience. This has been the choice since the beginning with the drawback you identified.

Pentaphon commented 1 month ago

Would you be still be ok with it if Nova was uploading every file on your device (not just file names) to a third party?

That is simply not happening though yet you are acting like this is the case with Nova.

However the impact in terms of feature discoverability and user friendliness of media indexing would be important on users. Nova is designed to process video libraries and its benefit is to provide a turnkey out of box experience.

@courville while I do think Aquavirus is fearing something that they shouldn't fear, I do think it is worth exploring ways to make scraping a manual process by default so that people who don't care to scrape all their device's files can skip or avoid the process upon installing Nova and so TMDB API usage is minimized by not wasting it on people who don't want to scrape TMDB anyway.

Perhaps a long term goal could be to

a) By default, only scrape the filenames in folders that the user manually tells Nova to add the library.

b) Upon install, have the user decide if they want to run Nova in private mode or regular mode rather than just do regular mode by default.

I think solution A would be easiest to implement.

AquaVirus commented 1 month ago

you are correct, we could disable automatic scanning and indexing of internal storage which would avoid exposing pre-processed video filenames to TMDB.

Yes, that would suffice.

That is simply not happening though yet you are acting like this is the case with Nova.

I'm not claiming that that's the case for Nova - it's just a what-if for the sake of argument.

a) By default, only scrape the filenames in folders that the user manually tells Nova to add the library.

b) Upon install, have the user decide if they want to run Nova in private mode or regular mode rather than just do regular mode by default.

Both solid solutions, although solution A doesn't seem any more discoverable than just keeping indexing disabled by default.

mschumacher69 commented 1 month ago

I think just adding an option to select which local folders to add to the library (similar to the case with smb shares) instead of scanning all the local folders would solve this problem.

nova-video-player / aos-AVP

[FEAT]: Privacy concern: Scraping enabled by default - should be disabled #1198

Description

Additional information