govCMS / GovCMS7

Current stable release of the main Drupal 7 GovCMS distribution, with releases mirrored at https://www.drupal.org/project/govcms
https://www.govcms.gov.au/
GNU General Public License v2.0
113 stars 76 forks source link

The File Usage count is wrong for file entity files #277

Open kurtfoster opened 8 years ago

kurtfoster commented 8 years ago

There is a bug with File Entity / Media // Workbench moderation around the file usage count. Currently the number is unusable for operations such as a sort on content/files.

There are a couple of relevant issues in these modules currently: https://www.drupal.org/node/2353491 https://www.drupal.org/node/2024619

Podgkin commented 8 years ago

Just adding my voice. We definately have this issue, and it makes it really hard to manage file lifecycles. If you can't sort by usage, then you can't get a list of unused files to start deleting them. Since each revision adds a use count, it is possible to end up with files used on one page having a usage count of several hundred. This means that Iin many cases the list of usage locations is unusable. This creates skepticism about the reliability of the content managment, and resistance to removing files, which in turn has a broader impact on the platform by expanding storage requirements. We have done significant work to overcome this problem, but it is ultimately a platform limitation.

fiasco commented 8 years ago

Thanks for logging the bug. Definately sounds like something that we'll want to address. If you're happy to provide a PR, we'll be happy to review.

We've put this on the roadmap and if the PR reaches us in time, hopefully we can get it into the 2.5 release.

Feng-Shui commented 8 years ago

I think this is actually useful. Knowing that a file is linked in previous versions of content makes a lot of sense if you need to revert revisions at some point and the files needs to be available again. If the file isn't counted in the previous revisions, this could result in it being marked as not in use and it could then be removed. Reverting the content could then result in a 404, or (of more concern in some Department's I've worked in who have files as an integral part of the page content) an "incorrect" revert.

I think what would be more useful is two counts, one for usage in current published revisions, and one for usage in published revisions (not a revision count, a count of the nodes themselves).

Thoughts?

Podgkin commented 8 years ago

@Feng-Shui Showing both is a good idea.

That said, the problem isn't just that it shows usage in previous revisions, that usage number is also very often incorrect. It is possible to create a new node with no file, then create a revision with a file, and have the usage count read as "2".

The problem lies in the order of events around node loading in the workbench moderation module. It should set the original to the current before updating, otherwise node_load returns the draft as the original. This order of events can result in incorrect data being saved, so theoretically this could extend beyond just incorrect file usage information, resulting in information being lost forever when a revision is updated.

While we can apply one of the patches linked by Kurt, It might actually be worth some deeper investigating into how much information can be lost, so we can determine if information actually has been lost.

kurtfoster commented 8 years ago

@Feng-Shui , @Podgkin , i wouldn't suggest applying one of those patches, I agree that this needs more work than just that simple action. They were added only for extra context / background on the issue.

Feng-Shui commented 8 years ago

@Podgkin ahh yeah, I'm picturing the code now, it does all that footwork around node_load for the current and past revisions, I remember that being a bit of a nightmare when working in this space previously.

I wasn't aware that the count was also incorrect even taking account of non-published revisions. I'm happy to put some time into this if we get a direction, we're going to need to sort this one as some stage too.

mstrelan commented 8 years ago

To add to this, it would be nice to include files that have been referenced in content, via LinkIt, etc. I came across this a while ago and ended up with a custom implementation, but the revisions / use count is still very confusing. I think this needs to be "fixed" in the file_entity module or another contrib module, not specifically govcms.

xtfer commented 7 years ago

@mstrelan It's not really possible to add "files linked in content" in the use count.

Podgkin commented 7 years ago

@xtfer It is possible, it is just very difficult, unreliable and hideously resource intensive.

We have been working on this problem pretty extensively, we've checked, rechecked and cross referenced all fields on all of our content. We think we have found nearly all of the links, but it is very difficult to actually prove we've found all of them.

Salsa wrote us a script which runs a query across every relevant field on every node on every relevant content type. It identifies links, follows those links through any redirects to their destination, checks if the destination is in our domain and checks if the destination is a file or a fieldable file entity. It then generates a csv report of the findings, which we cross referenced against the file usage count, and both internal and multiple external link checkers. It is a pretty nifty piece of kit, but it really strains the server since its basically saying "query ALL the things!" and it still isn't infallable, since we found some discrepencies between it and the other link check methods.

As a result I have been thinking for a while about the perfect world way to programaticaly resolve this problem. I think we would need something similar to the above script to run when ever a content revision is saved, check each relevant field on that node for a valid link to a valid file, and then it would need to check the file ID and update the file usage count. This would slow down content save times, but we could just add it to the cron queue to defer the server load and counteract the preceived user experience impact.

The fix to the current bug in the usage count will help, however @mstrelan is right that ultimately the file usage count is always going to be incorrect unless we check all fields.

Podgkin commented 7 years ago

I have an idea I'd like to run past the community.

If you create a view which filters body fields by "contains", the results it returns are based on the raw markdown of that field, not on the rendered field. In this way it is possible to generate a view of all instance of a link within body content.

If you create a view like this for each of your text, link or file fields, you would then be able to see all references to the chosen text from any field. If the "contains" value is passed from the current URL, then the view could show all refences to the current node or file from within all body fields. If you set those views to all show the delta results, you would then have an accurate file usage count.

I can see two problems:

  1. The performance impact is still large, since your views are querying a fair chunk of your database. This could be offset somewhat by making it admin only and enabling/disabling the view, so that it isn't always running.
  2. This will only check for one term at a time, so it would not count any alliases or redirects. This could be gotten around by implementing the solution directly in PHP as a module.

Still, even without going the whole hog and writing a new module, a single one of these views still allows you to check if a given file, word or node is in use in a given field, and it could be implemented in any SaaS environment without deploying any custom code.

In addition to file usage counts, the same approach could be taken to finding mentions of an old minister, or references to a given policy, which is the use case behind https://github.com/govCMS/govCMS/issues/151

mpigott commented 7 years ago

@Podgkin great idea about the view. I think it solved my particular problem.