stac-utils / stac-index

A service that lists all available and registered STAC catalogs and APIs.
https://stacindex.org
Apache License 2.0
7 stars 4 forks source link

Refactor checkDuplicates function #34

Closed zacdezgeo closed 1 year ago

zacdezgeo commented 1 year ago

Description

This PR modifies the checkDuplicates function to clean the URL and title strings by removing whitespaces, underscores, and hyphens before calculating the Levenshtein distance. Additionally, the function now only flags exact duplicates (Levenshtein distance = 0).

Steps for Testing:

Preliminary: Insert pystac into Database

  1. Database Setup: Ensure that your local database is set up and modify the connection in common.js to match your database specs.
  2. Insert Sample Data: Insert a sample record (pystac) into your public.ecosystem table:

    INSERT INTO public.ecosystem (
        url, title, summary, categories, email, extensions, api_extensions, created, updated, language
    )
    VALUES (
        'https://github.com/stac-utils/pystac', 'PySTAC', 'Sample Summary', ARRAY['Data Creation', 'Validation'], NULL, ARRAY['eo'], ARRAY[], '2022-01-01T00:00:00.000Z', '2022-01-01T00:00:00.000Z', 'Python'
    );

Step 1: Confirm Successful Insert of pgstac

  1. Run the Server: Navigate to the server directory and execute npm install followed by npm run dev.
  2. Test with CURL: Run the following CURL command to insert the new record pgstac:

    curl -X POST 'http://localhost:9999/add' \
    -H 'Content-Type: application/json' \
    -d '{
        "type": "ecosystem",
        "url": "https://github.com/stac-utils/pgstac",
        "slug": "pgstac",
        "title": "PgSTAC",
        ...
    }'

    This should successfully insert the record without flagging it as a duplicate.

Step 2: Validate Cleaning Function

  1. Test with Modified CURL: Run the following CURL command with modified URL and title:

    curl -X POST 'http://localhost:9999/add' \
    -H 'Content-Type: application/json' \
    -d '{
        "type": "ecosystem",
        "url": "https://github.com/stac-utils/p_g-stac",
        "slug": "p_g-stac",
        "title": "P g-STAC",
        ...
    }'

    The URL and title are intentionally filled with whitespaces, underscores, and hyphens. The function should clean these characters and flag the record as a duplicate, given the existing pgstac entry.

  2. Check for Error: If the cleaning function is working correctly, this CURL command should trigger a duplicate error.

zacdezgeo commented 1 year ago

Fixes #33

m-mohr commented 1 year ago

Thanks for the PR, left one comment above.

zacdezgeo commented 1 year ago

@m-mohr, let me know if you want any other changes and if I can help with deployment. I'll submit pgstac afterwards.

m-mohr commented 1 year ago

Thanks, please also remove the Levensthein import and dependency, I think it's not used anywhere else. Afterwards this is ready for deployment. I'll take care of deployment, thanks.

zacdezgeo commented 1 year ago

Ah, I should've caught that! Just fixed @m-mohr. Thanks!

m-mohr commented 1 year ago

Thanks, you did not yet remove the dependency from the package.json, right?

zacdezgeo commented 1 year ago

Still getting used to node development... removed dependency!

m-mohr commented 1 year ago

Thanks, merged. I'll redeploy in the next days and let you know once completed.

m-mohr commented 1 year ago

@zacharyDez I just redeployed STAC Index. You should now be able to submit your missing entries.