sfg-taxonpages / orthoptera

0 stars 0 forks source link

collection objects lost respositories #1

Closed typophyllum closed 9 months ago

typophyllum commented 10 months ago

Apparently repositories so far unknown to TW did not migrate, and the corresponding types and other collection objects are now with "Repository: not specified". Some examples:

African Natural History Research Trust, Hereford, UK (ANHRT) - 326 in total / 26 types (repo added for new specimens) Akdeniz University, Department of Biology, Zoological Museum (AUZM) - 50 in total/42 types Emirates Center for Wildlife Propagation, Morocco (ECWP) - 25/2 Institut National Agronomique de Tunisie (INAT) - 20/0 Universiti Brunei Darussalam Museum (UBDM) - 18/15 Collection B. Massa, University of Palermo (BMPC) - 206/42 C. Hemp Collection - 35 types Heller Collection - 528/36 Museu Anchieta, Porto Alegre (MAPA) - 1 holotype Collection of Sheyla Yong, La Habana, Cuba - 28/20 Brazil, Paraná, Curitiba, Universidade Federal do Paraná, Museu de Entomologia Padre Jesus Santiago Moure (DZUP) - 81/22 Insect collection of Agriculture Science Center, Ghyalchok, Gorkha, Nepal - 13/5 (repo added for new specimens) Annapurna Natural History Museum, Pokhara, Nepal (ANHM) - 6/6 (repo added for new specimens)

I guess we'd need to filter first the types for Repository not specified and then other collection objects to find all cases and think how to solve this.

LocoDelAssembly commented 10 months ago

List of missing repositories

SELECT *
FROM tblDepos D --- get_tw_repo_id.keys.join ", " v
WHERE D.DepoID NOT IN (2257, 76, 1737, 1982, 2371, 3614, 3691, 1669, 1523, 101, 1195, 1540, 1711, 1722, 1814, 1946, 2237, 2331, 2396, 3489, 3585, 664, 999, 2091, 2966, 3569, 203, 1397, 1607, 1634, 2040, 2499, 26, 1043, 1939, 2321, 3218, 3422, 3568, 936, 18, 1356, 1628, 1766, 1933, 2313, 3514, 882, 1526, 1678, 3168, 1608, 934, 3543, 1548, 3220, 1564, 120, 1324, 2008, 2415, 257, 1424, 1713, 2043, 2554, 3354, 3669, 843, 1866, 2234, 2596, 49, 1763, 1919, 1959, 2344, 3426, 3626, 103, 1206, 1419, 1563, 1746, 1784, 1888, 1998, 2398, 3431, 3429, 27, 970, 1211, 1444, 1663, 1699, 1728, 1779, 1792, 1940, 3189, 3383, 3441, 3636, 35, 1358, 1512, 1639, 1688, 1729, 1911, 1945, 2330, 3501, 207, 1565, 1679, 1852, 2049, 2504, 3570, 51, 1896, 3457, 368, 2045, 2667, 3366, 561, 2863, 41, 87, 1268, 1575, 1754, 1799, 1951, 2336, 2401, 3525, 86, 940, 1297, 1464, 1516, 1632, 1690, 1767, 1804, 1857, 1991, 2381, 3185, 3226, 3433, 3546, 3607, 3670, 849, 3723, 44, 1546, 1954, 2339, 3481, 43, 399, 978, 1189, 1456, 1533, 1643, 1676, 1693, 1730, 1778, 1898, 1953, 2338, 2698, 3212, 3225, 3396, 3445, 3527, 3562, 3549, 1559, 3639, 1635, 3504, 3665, 465, 2060, 2765, 2256, 449, 1280, 1800, 2074, 2307, 2749, 82, 1519, 1641, 1987, 2377, 3673, 73, 1336, 1538, 1637, 1717, 1743, 1876, 1979, 2368, 3327, 3506, 8, 1365, 1657, 1701, 1724, 1926, 2303, 1557, 1905, 3432, 1601, 3476, 864, 2265, 62, 1658, 1702, 1733, 1969, 2357, 3688, 10, 1716, 1751, 1841, 1928, 2305, 3460, 67, 78, 954, 1167, 1534, 1640, 1689, 1735, 1880, 1973, 2362, 2373, 3438, 146, 2021, 2441, 3370, 3453, 3645, 939, 3654, 3638, 3642, 3556, 3687, 3603, 1884, 3617, 1826, 3722, 22, 1936, 2317, 3649, 1875, 1579, 854, 2294, 3597, 3621, 353, 2094, 2651, 3579, 2268, 591, 1554, 2893, 3593, 834, 2768, 3464, 3718, 870, 415, 2099, 2714, 823, 2247, 3547, 3667, 821, 2246, 1747, 3169, 3679, 1883, 2296, 3631, 204, 474, 2500, 2774, 871, 911, 3382, 891, 3484, 1602, 1532, 1868, 839, 385, 2098, 2684, 892, 3365, 1677, 147, 1344, 2442, 294, 2591, 3178, 3142, 3707, 1633, 3500, 2270, 1619, 3461, 3463, 3450, 45, 1353, 1760, 1955, 2340, 597, 2899, 124, 2419, 2261, 144, 2019, 2439, 928, 825, 2249, 931, 3416, 3686, 2264, 829, 2250, 867, 2267, 1818, 644, 1307, 1450, 1788, 1795, 1848, 2221, 2946, 3392, 818, 1856, 2243, 2793, 3191, 3663, 1583, 198, 2038, 2494, 318, 2615, 524, 802, 2826, 3105, 2274, 845, 861, 1647, 1667, 1827, 3503, 3531, 3609, 292, 2589, 1622, 634, 2936, 459, 2759, 661, 2963, 3602, 925, 3559, 3542, 456, 620, 984, 1589, 2756, 2922, 3541, 3632, 532, 2834, 3202, 903, 309, 1369, 2606, 479, 1547, 1821, 2062, 2779, 3634, 3482, 3701, 856, 2489, 3158, 3720, 3640, 320, 2225, 2618, 842, 3195, 1775, 425, 841, 2724, 90, 1239, 1860, 1993, 2385, 901, 2259, 3124, 3125, 357, 2655, 2285, 3534, 374, 2976, 2283, 3151, 883, 3367, 3703, 1624, 1834, 1914, 2260, 877, 1544, 1872, 348, 2646, 800, 3103, 1566, 1567, 3678, 676, 2978, 929, 3470, 1873, 331, 336, 2629, 2634, 1882, 274, 1448, 2571, 1714, 96, 2391, 3427, 915, 3120, 3134, 3647, 3141, 3572, 159, 924, 2454, 3194, 3711, 1715, 3487, 3704, 3693, 189, 2484, 1620, 3479, 3575, 1867, 3658, 133, 2015, 2428, 50, 1535, 1638, 1771, 1849, 1960, 2345, 3605, 3616, 3712, 3165, 2289, 163, 2029, 2458, 37, 974, 1330, 1517, 1682, 1947, 2332, 3222, 3608, 3473, 56, 1502, 1964, 2351, 3661, 3171, 69, 413, 1416, 1736, 1975, 2364, 2712, 3581, 428, 1901, 2059, 2727, 3436, 33, 1246, 1470, 1536, 1705, 1764, 1887, 1943, 2328, 3468, 4, 1521, 1924, 2299, 3615, 1654, 1696, 774, 3077, 1687, 3571, 3672, 3699, 150, 996, 1310, 2023, 2445, 3413, 3451, 3625, 3627, 57, 985, 1237, 1577, 1631, 1668, 1731, 1777, 1965, 3198, 3447, 3494, 3518, 3544, 3641, 423, 2722, 3475, 253, 2079, 1550, 63, 945, 1187, 1414, 1734, 1783, 1850, 1970, 2358, 3368, 3592, 610, 1855, 2090, 2912, 868, 1514, 1659, 2284, 3536, 48, 1958, 2343, 3434, 2287, 3, 947, 1218, 1399, 1616, 1923, 2298, 3341, 52, 1410, 1961, 445, 2744, 3127, 224, 1380, 1774, 1847, 2042, 2521, 3611, 912, 25, 1772, 1806, 1903, 1938, 2320, 3493, 521, 2233, 2823, 3145, 156, 2025, 2451, 3439, 1525, 1878, 206, 609, 1288, 1842, 2089, 2503, 2911, 3604, 827, 502, 2803, 814, 2240, 3117, 3683, 554, 2072, 2856, 3184, 853, 3548, 1748, 3164, 3182, 3648, 576, 2077, 2878, 1555, 878, 338, 3454, 874, 908, 38, 390, 995, 1655, 1660, 1948, 2054, 2333, 2689, 3480, 3507, 152, 1023, 1573, 1902, 2024, 2447, 2, 1233, 1524, 1791, 1844, 1922, 2297, 29, 1173, 1941, 114, 1256, 1824, 2006, 3490, 16, 1215, 1651, 1670, 1703, 1726, 1815, 1932, 2311, 3203, 3455, 3495, 3582, 2288, 3467, 127, 2010, 2422, 20, 1048, 1259, 1727, 1822, 1832, 1934, 2315, 862, 847, 3695, 3471, 1706, 1785, 2324, 3583, 175, 1829, 2470, 1630, 3517, 960, 1177, 1474, 1541, 1626, 1691, 1723, 1781, 1787, 1891, 1918, 1925, 2301, 2407, 3402, 3519, 344, 2051, 2642, 817, 1357, 2242, 3551, 74, 1017, 1250, 1484, 1545, 1758, 1786, 1912, 1915, 1980, 2369, 3206, 575, 2076, 2877, 3537, 178, 219, 273, 1283, 1549, 1652, 1698, 1789, 2041, 2516, 2570, 3440, 53, 2348, 40, 926, 1406, 1551, 1708, 1864, 1950, 2335, 3387, 75, 643, 1281, 1710, 1981, 2370, 2945, 918, 1837, 129, 2011, 3359, 17, 1596, 2312, 549, 2851, 72, 1865, 1978, 2367, 1513, 21, 1348, 1845, 1935, 2316, 922, 1773, 1798, 1900, 2263, 3521, 46, 990, 1255, 1956, 2341, 1803, 200, 478, 2039, 2061, 2496, 2778, 3477, 3573, 1720, 3143, 24, 1816, 1937, 2319, 1642, 60, 1275, 1712, 1732, 1830, 1861, 1886, 1968, 2355, 3375, 3588, 3708, 110, 2004, 2405, 491, 930, 1349, 1825, 2064, 2792, 440, 2255, 2739, 1817, 3154, 3452, 347, 1034, 1197, 1429, 1581, 1653, 1674, 1680, 1695, 1741, 1780, 2082, 2534, 2645, 3417, 3437, 3553, 3606, 857, 2046, 2668, 3170, 3335, 3620, 921, 1675, 54, 1745, 1871, 1899, 1962, 2349, 541, 2068, 2843, 614, 987, 1613, 2092, 2916, 1656, 1700, 1836, 3135, 289, 2105, 2586, 824, 2248, 3652, 523, 2096, 2825, 574, 2075, 2876, 855, 3529, 15, 1384, 1570, 1756, 1931, 2310, 520, 2088, 2822, 3449, 2290, 1681, 1685, 830, 1666, 2251, 11, 1725, 1910, 1929, 2306, 3456, 1671, 2254, 2277, 388, 2053, 2687, 672, 2235, 2974, 1615, 1673, 1719, 2346, 412, 451, 1438, 1797, 1869, 1895, 1916, 2002, 2403, 2711, 3338, 1598, 281, 2097, 2578, 3635, 94, 1531, 1661, 1697, 1796, 1831, 2389, 879, 895, 3596, 1707, 513, 1749, 2080, 2815, 3118, 66, 1308, 1755, 1808, 1972, 2361, 3584, 1894, 1644, 567, 2103, 2869, 3633, 182, 2477, 1603, 3219, 461, 2106, 2761, 441, 2058, 2740, 109, 1802, 2003, 2404, 3147, 34, 1944, 2329, 260, 2081, 2557, 70, 1462, 1585, 1976, 2365, 3598, 3622, 3492, 130, 2012, 2425, 603, 1493, 1537, 1709, 2078, 2905, 1606, 888, 59, 1012, 1543, 1967, 2354, 3221, 3469, 3644, 3599, 89, 1302, 1992, 2384, 2258, 3123, 1518, 3600, 3653, 1605, 3448, 3524, 850, 3140, 3520, 3637, 483, 1752, 2063, 2784, 844, 1611, 161, 2028, 2456, 71, 1977, 2366, 3131, 3650, 920, 3474, 39, 1828, 1949, 2334, 3550, 1556, 508, 2070, 2809, 657, 2083, 2959, 3138, 890, 1874, 3130, 3657, 2347, 351, 1455, 2086, 2649, 1721, 141, 2017, 2436, 3680, 30, 195, 1351, 1897, 1942, 2325, 2491, 3539, 3662, 47, 1267, 1370, 1957, 2342, 2382, 3391, 23, 2318, 3172, 3173, 1625, 1907, 2036, 2492, 3428, 157, 2026, 2452, 1568, 145, 2020, 2440, 534, 2066, 2836, 279, 2093, 2576, 283, 1853, 2100, 2580, 3393, 831, 2252, 1759, 3696, 65, 1971, 2360, 621, 2107, 2923, 677, 2104, 2979, 3532, 840, 148, 1833, 2022, 2443, 3516, 14, 1843, 1889, 1930, 2309, 2435, 916, 397, 2084, 2696, 3586, 2271, 183, 1451, 1586, 1672, 1858, 2032, 2478, 205, 2044, 2502, 265, 2085, 2562, 560, 1469, 2087, 2862, 122, 2009, 2417, 3629, 373, 2050, 2672, 933, 1650, 1694, 420, 2055, 2719, 3458, 3159, 1662, 1692, 1740, 1809, 3528, 2473, 139, 2434, 84, 1263, 1490, 1906, 1989, 2379, 401, 439, 1597, 2057, 2738, 55, 1963, 2350, 3674, 2278, 3137, 104, 1999, 2399, 1819, 3523, 460, 1913, 2760, 95, 1810, 1996, 2390, 3580, 241, 2073, 2538, 100, 1997, 2395, 158, 875, 2027, 2453, 135, 2016, 2430, 132, 2014, 2427, 370, 1291, 1835, 2048, 2669, 3589, 197, 2037, 2493, 1768, 801, 1373, 1515, 1522, 1704, 1776, 2238, 3104, 3655, 3533, 539, 2841, 1665, 589, 2095, 2891, 111, 1527, 1846, 2005, 2406, 3656, 192, 2487, 3574, 1890, 1859, 938, 927, 190, 2035, 2485, 2286, 77, 826, 1753, 1983, 2372, 832, 2253, 376, 1851, 2052, 2675, 81, 98, 1320, 1986, 2376, 2393, 160, 215, 816, 2241, 2455, 2512, 42, 1363, 1908, 1952, 2337, 313, 2223, 2610, 1684, 2071, 2810, 278, 2220, 2575, 500, 1892, 1995, 2065, 2801, 3526, 3466, 3538, 3595, 181, 2476, 538, 1552, 1646, 2510, 2840, 3498, 3646, 3702, 1757, 1794, 1553, 3664, 92, 1335, 1812, 1994, 2387, 131, 2013, 2426, 893, 2275, 58, 1838, 1966, 2353, 3681, 79, 1984, 2374, 3563, 3594, 3676, 9, 1742, 1807, 1854, 1927, 2304, 173, 2031, 2468, 2262, 455, 2755, 80, 1823, 1985, 2375, 1881, 107, 1345, 1588, 1769, 1801, 2001, 2402, 3483, 1528, 3623, 869, 186, 1765, 2034, 2481, 314, 2101, 2611, 2047, 2636, 3362, 2272, 2293, 105, 1542, 2000, 2400, 115, 1627, 1718, 1739, 1805, 2007, 2410, 3515, 443, 2069, 2742, 2281, 3706, 3628, 85, 155, 170, 172, 1314, 1496, 1539, 1738, 1793, 1820, 1879, 1885, 1909, 1990, 2380, 2450, 2465, 2467, 3348, 3478, 3700, 83, 1441, 1870, 1988, 2378, 427, 2056, 2726, 1599, 188, 2483, 935, 2273, 860, 142, 2018, 2437, 185, 2033, 2480, 409, 2708, 904, 3443, 68, 1974, 2363, 3446, 3567, 1645, 3499) AND
  EXISTS (SELECT * FROM tblSpecimens WHERE DepoID = D.DepoID AND FileID = 1)

Import logs don't seem to record when a repo was not found so this would require a second pass over already imported data to add data attribute having the original SF DepoID (and perhaps other fields) as an initial step. We need to first check if any of the missing Depos already exist as reposotories in TW, we must avoid just uploading them all.

typophyllum commented 10 months ago

These repos are probably all missing in TW (I added the first one for new types in Massa 2023). So far I counted more than 1200 specimens without repo, almost 200 of them types, and there are probably many more.

mjy commented 10 months ago

@typophyllum You can do a lot of queries in filter CO to get numbers. This is "without repository": /tasks/collection_objects/filter?per=50&repository=false&extend[]=dwc_occurrence&extend[]=repository&extend[]=current_repository&extend[]=collecting_event&extend[]=taxon_determinations&extend[]=identifiers&exclude[]=object_labels&page=1.

So yes, there are many records without Repository linked, maybe 1/3 of the close to 100k.

Note, don't confuse "current repository" (a repository, typically not provided, indicating the current physical location of the specimen, with "repository" (the "home" repository, where it will ultimately reside).

Another note- "Current repository = with` may be broken.

LocoDelAssembly commented 10 months ago

These repos are probably all missing in TW (I added the first one for new types in Massa 2023). So far I counted more than 1200 specimens without repo, almost 200 of them types, and there are probably many more.

Technically are all missing, but maybe a "synonym" exist for some of them.

typophyllum commented 10 months ago

The original OSF has numerous specimen records without repository (unknown or georeferenced field observations linked to photos of live individuals). Would it be possible to match the repos in the old OSF (or all species files) against the repos used in TW, to see which ones are missing? Otherwise we would need to filter for collection objects that are types and don't have a repo (such already exist in the old OSF, but not that many). The lost repos with only normal specimens (e.g. INAT) will be more difficult to find.

LocoDelAssembly commented 10 months ago

Notes on how repositories were mapped by Marilyn (and others?):

/*
Notes about SF depos and mappings onto TW repos:  

        This task was done manually on SFdbMB database (finished 22 June 2017). Max(DepoID) was 3693. Max(DepoID) on SFdb on the same date 
        was 3725. Updates could be manually mapped using the max(DepoID) as a basis, though may not be a useful exercise. Any unmapped depo
        should have a DepoString associated with it which can be used as a data_attribute or note associated with the collection object, etc.
        This program to generate sfDepoStrings should be rerun when a new import with fresh data is processed.

        The mapping process started with a spreadsheet containing the TW repo_id, acronyms and name columns (Excel files TWrepos.xlsx,
        TWrepos.xls, and TWreposBACKUP.xlsx). A new column was added as SFDepoIDarray which contained comma separated SFDepoIDs corresponding 
        to the TW repo. The two columns, TWDepoID and SFDepoIDarray (longest record length = 120 characters), were eventually (through a very 
        ugly process) put in tabular form, sfTWDepoMappings.  This is the table that is made available to TW as tab delimited text file. The 
        depo data should be uploaded then generated into a hash consisting of the SFDepoID as key and the TWRepoID as value.

        Query to generate sfTWDepoMappings: SELECT * FROM [SFdbMB].[dbo].[sfTWDepoMappings] where TWDepoID > 0 order by TWDepoID

        Note: The following depos are now found in grbio.org website that were not in TW (and the corresponding SFDepoID):
                ABCL (2257), HN (3174), UBDH (3715)             
*/
typophyllum commented 10 months ago

So all depos added or modified after June 2017 couldn't migrate appropriately?

LocoDelAssembly commented 10 months ago

Added, I don't think modifications causes depos to be lost given they preserve their DepoID.

typophyllum commented 10 months ago

Could you perhaps extract from the old species files a list of DepoIDs, depository name, and species file number? So we could verify especially the ones with DepoID > 3693.

LocoDelAssembly commented 10 months ago

What information is missing in the spreadsheet linked here?

mabecabrera commented 9 months ago

In this case... OSF seems to have a duplicated repository.

image

Just informing here to keep track In TW it is present as

image
typophyllum commented 9 months ago

The 225 specimens in the Shanghai Entomology Museum, Chinese Academy of Sciences (SEM) seem to have been distributed into several different repos. Or what happened here?

imagen

imagen

LocoDelAssembly commented 9 months ago

I think them were mapped to IEAS

https://sfg.taxonworks.org/tasks/collection_objects/filter?per=50&repository_id=4764&extend%5B%5D=dwc_occurrence&extend%5B%5D=repository&extend%5B%5D=current_repository&extend%5B%5D=collecting_event&extend%5B%5D=taxon_determinations&extend%5B%5D=identifiers&exclude%5B%5D=object_labels&page=1

https://sfg.taxonworks.org/repositories/4764

I guess it is incorrect?

typophyllum commented 9 months ago

Yes, Institute of Entomology IEAS has Project use: 229. At least it isn't totally incorrect. It's probably identical to SIES. A new holotype I just assigned to SHIM. Very confusing.

typophyllum commented 9 months ago

Now the List of missing repositories has links to TW repos, except for a few obsolete and doubtful ones with very few specimens (the green ones I already fixed manually).

LocoDelAssembly commented 9 months ago

log.txt