microbiomedata / nmdc-metadata

Managing metadata and policy around metadata in NMDC
https://microbiomedata.github.io/nmdc-schema/
Other
2 stars 0 forks source link

Add missing data_objects #234

Open jbeezley opened 3 years ago

jbeezley commented 3 years ago

There are several data objects missing from the mongo instance that are referenced by other objects. None of these will break anything on the portal, but they will be missing from search results.

The following data_objects are referenced by omics_processing:

igsn:IEWFS000A
igsn:IEWFS000B
igsn:IEWFS000I
igsn:IEWFS000J
igsn:IEWFS000K
igsn:IEWFS0019
igsn:IEWFS001A
igsn:IEWFS001B
igsn:IEWFS0001
igsn:IEWFS0003
igsn:IEWFS0005
igsn:IEWFS0006
igsn:IEWFS0007
igsn:IEWFS0009
igsn:IEWFS000D
igsn:IEWFS000E
igsn:IEWFS000F
igsn:IEWFS000G
igsn:IEWFS000L
igsn:IEWFS000N
igsn:IEWFS000O
igsn:IEWFS000P
igsn:IEWFS000R
igsn:IEWFS000S
igsn:IEWFS000U
igsn:IEWFS000V
igsn:IEWFS000X
igsn:IEWFS000Z
igsn:IEWFS0012
igsn:IEWFS0015
igsn:IEWFS0016
igsn:IEWFS0018
igsn:IEWFS001C
igsn:IEWFS001D
igsn:IEWFS001E
igsn:IEWFS001F
igsn:IEWFS001G
igsn:IEWFS001H
igsn:IEWFS0002
igsn:IEWFS0004
igsn:IEWFS0008
igsn:IEWFS000C
igsn:IEWFS000H
igsn:IEWFS000Q
igsn:IEWFS000W
igsn:IEWFS000Y
igsn:IEWFS0010
igsn:IEWFS0011
igsn:IEWFS0014
igsn:IEWFS0017
igsn:IEWFS0013

And these are referenced by metagenome annotation:

nmdc:7e0bb15dc62ea4a5ae94f51af347129f
nmdc:6a455b07be6e9c6b3f0631858a8ade17
nmdc:376cc399590f368eaf5a486087750077
nmdc:fddf8cd12a559ba5c1dc7749ea6ffadb
nmdc:6f83c763978b8cfccd1dbc3c1fff4976
nmdc:fcc5bd82615fab43bb54006632862521
nmdc:4b9b0f82bf50950ecf8b77d24a141565
nmdc:8c96c0ddd734b2d7c3355a85fb478727
nmdc:369be15c709557b137c6ad6994ced3e4
nmdc:1f02bef68a0a71ecbb325dfdbff6ae85
nmdc:44c38f11b62931b22a3e38e44e12a99b
nmdc:8f6ed75bd49fa03502adbd7ec7c55a09
nmdc:fd4e4871caef5352801a0d58b2fc5727
nmdc:eb216db5f15b5982f60cb2c5f0f82b97
nmdc:b8f11f271313ebce716edfe8a9650118
nmdc:8ef63c54cbfef72c19733e48ad0d1961
nmdc:18b8afd638a9bb80a633f138150b7edd
nmdc:753162645bb37b64f3ef9c0b2ca8a935
nmdc:3d2cc3c5ba651c5f92302ee5c1c0d36b
nmdc:35280ee871ca08e52849a30f18c497b0
nmdc:c8d6693287398701b91c4d194856d0f3
nmdc:03c49b063726126a4526bc96c3f03078
nmdc:0114daea61986c6cf6290657d1fa8ee0
nmdc:157f619acb8d1af497dbd311bb0129e0
nmdc:7455b386b754925ce055f8f585f6242c
nmdc:8ec141e339b4bedc49fbef7236422bec
nmdc:565eb354afc5db4ed502bb6dece91d03
nmdc:88c54f9cd321218cfecdd844c999f402
nmdc:957600e89955173435e9b35666e3f1d5
nmdc:957a5665e44143eea0c3a99b5665a51d
nmdc:690271181735467ff2f978d804ce4fee
nmdc:af403f9e2180b4ee5f0d536f6130a50d
nmdc:61f06a2309788ded26b1fdec53ca3791
nmdc:ee594da0ca22271208e72ed9480b9878
nmdc:f0bcb32cfb78fa6abb5be2ac7bd48284
nmdc:d16c1df1b6eca2a3997c93250673c58d
nmdc:e4cf91ffe121a58186e6f123f117e0c2
nmdc:e7e7582466f2c7d419e1d03f0b529879
nmdc:80e2cf75fff69f6111ea7a738fe68eac
nmdc:18589f426c24e06ff58610dcd48f3bd9
nmdc:59e944a5bd686bb1d85d9ad06356854a
nmdc:bb14b03eb4a4e30fcea4b9faa98c08e0

@dwinston

wdduncan commented 3 years ago

@jbeezley I was wanting to check on the status of this ticket. Can I close it?

jbeezley commented 3 years ago

I don't see them in mongo yet.

dwinston commented 3 years ago

@jbeezley can you clarify the exact nature of the omics_processing issue? It seems that those ids are biosample ids rather than data object ids, and referenced from the has_input field, i.e.

import os

from dotenv import load_dotenv
load_dotenv(os.path.expanduser("~/.nmdc_mongo.env"))

from nmdc_mongo import get_db

db_share = get_db("dwinston_share")

ids_ops = """
igsn:IEWFS000A
igsn:IEWFS000B
igsn:IEWFS000I
igsn:IEWFS000J
igsn:IEWFS000K
igsn:IEWFS0019
igsn:IEWFS001A
igsn:IEWFS001B
igsn:IEWFS0001
igsn:IEWFS0003
igsn:IEWFS0005
igsn:IEWFS0006
igsn:IEWFS0007
igsn:IEWFS0009
igsn:IEWFS000D
igsn:IEWFS000E
igsn:IEWFS000F
igsn:IEWFS000G
igsn:IEWFS000L
igsn:IEWFS000N
igsn:IEWFS000O
igsn:IEWFS000P
igsn:IEWFS000R
igsn:IEWFS000S
igsn:IEWFS000U
igsn:IEWFS000V
igsn:IEWFS000X
igsn:IEWFS000Z
igsn:IEWFS0012
igsn:IEWFS0015
igsn:IEWFS0016
igsn:IEWFS0018
igsn:IEWFS001C
igsn:IEWFS001D
igsn:IEWFS001E
igsn:IEWFS001F
igsn:IEWFS001G
igsn:IEWFS001H
igsn:IEWFS0002
igsn:IEWFS0004
igsn:IEWFS0008
igsn:IEWFS000C
igsn:IEWFS000H
igsn:IEWFS000Q
igsn:IEWFS000W
igsn:IEWFS000Y
igsn:IEWFS0010
igsn:IEWFS0011
igsn:IEWFS0014
igsn:IEWFS0017
igsn:IEWFS0013
""".strip().splitlines()

len(ids_ops) # 51
db_share.biosample_set.count_documents({"id": {"$in": ids_ops}}) # 51

As for the metagenome annotation activity set docs, it does seem that the has_input field references data objects, and the 42 you list are missing data objects:

ids_mga = """
nmdc:7e0bb15dc62ea4a5ae94f51af347129f
nmdc:6a455b07be6e9c6b3f0631858a8ade17
nmdc:376cc399590f368eaf5a486087750077
nmdc:fddf8cd12a559ba5c1dc7749ea6ffadb
nmdc:6f83c763978b8cfccd1dbc3c1fff4976
nmdc:fcc5bd82615fab43bb54006632862521
nmdc:4b9b0f82bf50950ecf8b77d24a141565
nmdc:8c96c0ddd734b2d7c3355a85fb478727
nmdc:369be15c709557b137c6ad6994ced3e4
nmdc:1f02bef68a0a71ecbb325dfdbff6ae85
nmdc:44c38f11b62931b22a3e38e44e12a99b
nmdc:8f6ed75bd49fa03502adbd7ec7c55a09
nmdc:fd4e4871caef5352801a0d58b2fc5727
nmdc:eb216db5f15b5982f60cb2c5f0f82b97
nmdc:b8f11f271313ebce716edfe8a9650118
nmdc:8ef63c54cbfef72c19733e48ad0d1961
nmdc:18b8afd638a9bb80a633f138150b7edd
nmdc:753162645bb37b64f3ef9c0b2ca8a935
nmdc:3d2cc3c5ba651c5f92302ee5c1c0d36b
nmdc:35280ee871ca08e52849a30f18c497b0
nmdc:c8d6693287398701b91c4d194856d0f3
nmdc:03c49b063726126a4526bc96c3f03078
nmdc:0114daea61986c6cf6290657d1fa8ee0
nmdc:157f619acb8d1af497dbd311bb0129e0
nmdc:7455b386b754925ce055f8f585f6242c
nmdc:8ec141e339b4bedc49fbef7236422bec
nmdc:565eb354afc5db4ed502bb6dece91d03
nmdc:88c54f9cd321218cfecdd844c999f402
nmdc:957600e89955173435e9b35666e3f1d5
nmdc:957a5665e44143eea0c3a99b5665a51d
nmdc:690271181735467ff2f978d804ce4fee
nmdc:af403f9e2180b4ee5f0d536f6130a50d
nmdc:61f06a2309788ded26b1fdec53ca3791
nmdc:ee594da0ca22271208e72ed9480b9878
nmdc:f0bcb32cfb78fa6abb5be2ac7bd48284
nmdc:d16c1df1b6eca2a3997c93250673c58d
nmdc:e4cf91ffe121a58186e6f123f117e0c2
nmdc:e7e7582466f2c7d419e1d03f0b529879
nmdc:80e2cf75fff69f6111ea7a738fe68eac
nmdc:18589f426c24e06ff58610dcd48f3bd9
nmdc:59e944a5bd686bb1d85d9ad06356854a
nmdc:bb14b03eb4a4e30fcea4b9faa98c08e0
""".strip().splitlines()

db_share.metagenome_annotation_activity_set.count_documents({"has_input": {"$in": ids_mga}}) # 42

ids_ds = set(db_share.data_object_set.distinct("id"))
ids_mga_all = set(db_share.metagenome_annotation_activity_set.distinct("has_input"))
len(ids_mga_all - ids_ds) # 42

I don't know why they are missing. @dehays any ideas here?

jbeezley commented 3 years ago

Yes, it appears as if the omics_proccessing issue has been resolved. Now that I run the script again, I see two more missing data objects in the metaP collection:

nmdc:7bfe2f3c086105ffe665317a21af38d3
nmdc:ff5f339ebacb8f723d133f3c2daff1bf
dehays commented 3 years ago

@jbeezley @dwinston If I understand - the omics_processing (projects in Jon's schema) - references to Brodie biosamples (igsn ID biosamples) is no longer an issue. Meaning those biosample docs are now there.

But there are still 42 metaG annotation and 2 metaP analysis that have has_input references to data objects that are not present.

My idea on why they are not present is that they were not included in the provided data object JSON. Donny - can you grab the IDs for the metaG annotation and metaP that reference the data object IDs above. I can then follow up with Shane and Sam.

dwinston commented 3 years ago

Here you go, @dehays . The 2 missing metaP data objects are referenced by 33 metaP analysis docs. I noted also that each metaP analysis doc has three entries in its "has_input" array.

docs_metaG = list(
    db_share.metagenome_annotation_activity_set.find({
        "has_input": {"$in": ids_mga}}, ["has_input", "id"]
    ))
print(len(docs_metaG), "affected docs")

for doc in docs_metaG:
    print("metaG ID", doc["id"])
    print("missing data_object ID", doc["has_input"][0])
42 affected docs
metaG ID nmdc:c7e6625c228fb16c512a0ceefd10fdcf
missing data_object ID nmdc:3d2cc3c5ba651c5f92302ee5c1c0d36b
metaG ID nmdc:6dfdf838817c96138022176ec33de297
missing data_object ID nmdc:369be15c709557b137c6ad6994ced3e4
metaG ID nmdc:11592ec20682d5bc349b293ff6d61f9e
missing data_object ID nmdc:8ef63c54cbfef72c19733e48ad0d1961
metaG ID nmdc:1993a481f92d491d0550ae7c97233164
missing data_object ID nmdc:8ec141e339b4bedc49fbef7236422bec
metaG ID nmdc:a2eaeceb2f0a6b07083bdffd24e5f713
missing data_object ID nmdc:35280ee871ca08e52849a30f18c497b0
metaG ID nmdc:07ee3c5a879a27d082ee3e6f3518ca1b
missing data_object ID nmdc:8f6ed75bd49fa03502adbd7ec7c55a09
metaG ID nmdc:8c59b25253e4a63f0731b74909855e43
missing data_object ID nmdc:af403f9e2180b4ee5f0d536f6130a50d
metaG ID nmdc:a96c1578090a15845e5920f52dc01a44
missing data_object ID nmdc:957a5665e44143eea0c3a99b5665a51d
metaG ID nmdc:34472e64c7f249a6e1f2b0f9445b89d6
missing data_object ID nmdc:fcc5bd82615fab43bb54006632862521
metaG ID nmdc:ff1fe327ab9d07ae8affd29e3dbef16c
missing data_object ID nmdc:c8d6693287398701b91c4d194856d0f3
metaG ID nmdc:567e35cfffbc1e0ebc1e2c781ce726e3
missing data_object ID nmdc:fddf8cd12a559ba5c1dc7749ea6ffadb
metaG ID nmdc:e1952aad2afebadbac8eb462b6a84d2b
missing data_object ID nmdc:7455b386b754925ce055f8f585f6242c
metaG ID nmdc:647b163bdcb9528b8dcc8ab9b506c957
missing data_object ID nmdc:7e0bb15dc62ea4a5ae94f51af347129f
metaG ID nmdc:5c7d758a1ae4be3debf95de04fc0e50b
missing data_object ID nmdc:18b8afd638a9bb80a633f138150b7edd
metaG ID nmdc:7a119c050961e0618c731187dba892a5
missing data_object ID nmdc:88c54f9cd321218cfecdd844c999f402
metaG ID nmdc:dbce99ec57e05f3ff39af0223e3dbfee
missing data_object ID nmdc:61f06a2309788ded26b1fdec53ca3791
metaG ID nmdc:0e9fb1e720caf88f61ab8bc4d866af7e
missing data_object ID nmdc:6a455b07be6e9c6b3f0631858a8ade17
metaG ID nmdc:a1447564a93994425b47569f4110dce3
missing data_object ID nmdc:b8f11f271313ebce716edfe8a9650118
metaG ID nmdc:f19f53cba21dcecfd7c8fc0feac2277c
missing data_object ID nmdc:957600e89955173435e9b35666e3f1d5
metaG ID nmdc:ce0717d5f24e153fe33a42fec43b026b
missing data_object ID nmdc:8c96c0ddd734b2d7c3355a85fb478727
metaG ID nmdc:63bb43ea7f71db994f4c21b5d2ca7c3e
missing data_object ID nmdc:fd4e4871caef5352801a0d58b2fc5727
metaG ID nmdc:6aa6cb8058330eddc3c4ffe2418f96e2
missing data_object ID nmdc:ee594da0ca22271208e72ed9480b9878
metaG ID nmdc:51880e32d7564747a46f00854e071d32
missing data_object ID nmdc:6f83c763978b8cfccd1dbc3c1fff4976
metaG ID nmdc:530593ca6433c5a14a82f66f99837e84
missing data_object ID nmdc:03c49b063726126a4526bc96c3f03078
metaG ID nmdc:cf6b096e809a5e4c4718da5553686651
missing data_object ID nmdc:753162645bb37b64f3ef9c0b2ca8a935
metaG ID nmdc:adef6b4b4874adb135620a21f27a53b6
missing data_object ID nmdc:1f02bef68a0a71ecbb325dfdbff6ae85
metaG ID nmdc:f863d392fcda8b1873b80168a1d672ba
missing data_object ID nmdc:157f619acb8d1af497dbd311bb0129e0
metaG ID nmdc:4113345e630c2a4c9a3d535982a3480b
missing data_object ID nmdc:44c38f11b62931b22a3e38e44e12a99b
metaG ID nmdc:c64ec801a3898b29fa68eb83a23f18c9
missing data_object ID nmdc:eb216db5f15b5982f60cb2c5f0f82b97
metaG ID nmdc:b8f5351b114af34d5357646dbef04478
missing data_object ID nmdc:690271181735467ff2f978d804ce4fee
metaG ID nmdc:7ef7c51072a8fa8151571ab602d54277
missing data_object ID nmdc:0114daea61986c6cf6290657d1fa8ee0
metaG ID nmdc:200022c0672deff53cda040fee54b9e6
missing data_object ID nmdc:376cc399590f368eaf5a486087750077
metaG ID nmdc:c53e4c651cfd13ad8183925a92d7023a
missing data_object ID nmdc:565eb354afc5db4ed502bb6dece91d03
metaG ID nmdc:da77b0888e64217118bddd9ca88a5797
missing data_object ID nmdc:4b9b0f82bf50950ecf8b77d24a141565
metaG ID nmdc:994588e22bb440eefab12f51e8db6544
missing data_object ID nmdc:e4cf91ffe121a58186e6f123f117e0c2
metaG ID nmdc:18261bfa0823d43b13f74b931c13a1df
missing data_object ID nmdc:18589f426c24e06ff58610dcd48f3bd9
metaG ID nmdc:8cd14e5ca612bfe923e2d8f54da25fec
missing data_object ID nmdc:f0bcb32cfb78fa6abb5be2ac7bd48284
metaG ID nmdc:5b0dba88801b500203d5b763984251b7
missing data_object ID nmdc:e7e7582466f2c7d419e1d03f0b529879
metaG ID nmdc:3937b209cad2769633a309e8bd646a09
missing data_object ID nmdc:bb14b03eb4a4e30fcea4b9faa98c08e0
metaG ID nmdc:6ace961767fd864cda21d485bf5711a8
missing data_object ID nmdc:59e944a5bd686bb1d85d9ad06356854a
metaG ID nmdc:65ff4bf5e258bcbf8d9e0d956a32988a
missing data_object ID nmdc:d16c1df1b6eca2a3997c93250673c58d
metaG ID nmdc:50e40947cb4664d4b4163ab37f3d4103
missing data_object ID nmdc:80e2cf75fff69f6111ea7a738fe68eac

and

ids_metaP = """
nmdc:7bfe2f3c086105ffe665317a21af38d3
nmdc:ff5f339ebacb8f723d133f3c2daff1bf
""".strip().splitlines()

docs_metaP = list(
    db_share.metaproteomics_analysis_activity_set.find({
        "has_input": {"$in": ids_metaP}}, ["has_input", "id"]
    ))
print(len(docs_metaP), "affected docs")

for doc in docs_metaP:
    print("metaP ID", doc["id"])
    print("missing data_object ID", next(inp for inp in doc["has_input"] if inp in ids_metaP))
33 affected docs
metaP ID nmdc:e642e3d734849753562e09e7ec3c9caa
missing data_object ID nmdc:ff5f339ebacb8f723d133f3c2daff1bf
metaP ID nmdc:1bfd6e40ac02f766c6c45581e696ddf1
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:5041575072fa5dc7b7e4f42f95584968
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:8aa240e26ad9c9bc7fc77bb48d2fb0da
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:e7164af2295a144b0855d02643fb0cd9
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:81d5da75e9a8e8637f7aab3e4ff70f24
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:77e34889d1be229f3aff74d0b449d4e1
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:c254a4df8d62db7d128bb96a02011381
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:54c3eb3d3cec3f142f39bd5efff05c2f
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:9ce22398cc92e5c969769afaf611b137
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:1ec84045ca630dc2a5695a9a8e92f985
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:98ad100ae227a58340b72d74adacaea0
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:8ad20466b614a1c2fdfa8d66dcf86e38
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:b1ab4b501e9c5c70e593b384283f2270
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:e094cf2a26b4f2a95d27c480abe81717
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:ccba2d4d4c04fab0a8424de4c35f1645
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:174d4e2e75f34a25bc02b6bf02ddd01c
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:e2bf7a7250f26990dbda7c648d7a8cab
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:cb00d652d71e9a926acaaad2e70f0cfe
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:da3b4b60eabac7f3788e734bebae7f50
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:f60f4a40e7cb8fe91c6462c56b3ad0f7
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:a2daaa3a2fe532754062f275d0ddcf7c
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:777c6c3a1770492f0c7094ee813be6ab
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:b232dc4ec98e8304ed77e59dadbacab6
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:b68c2721cdae7f00fe6c3e4400fb0fbc
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:4a8df3938084893addbaf51464148c78
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:5032ce3634219b35e2c6d0a3ce84c0da
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:82ab2b5905d172a7afb023997258daf7
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:9fe6ce2a57be10eb6dcaa1cba462d905
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:4ca1a1f561f0eaeb1cb40a54eee6be9c
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:9ccef749be4e638942d0931511bdced8
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:018bc0905284928015d7ff11b4d073d1
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3
metaP ID nmdc:ab0ad22df5f89ae652174f5d189305d5
missing data_object ID nmdc:7bfe2f3c086105ffe665317a21af38d3