Closed j-hagedorn closed 8 months ago
@sdaranyi https://github.com/sdaranyi , can you please provide examples where both the motif number and the label are present in the id field? That would help me to find the issue.
In TMI, starting with rows 1613-1614, the content changes and empty rows as space are inserted. This lasts until rows 10278-10279.
Same in rows 26415, 26425, 27036, 33157, 45630-45632.
On Mon, 8 Jan 2024 at 09:13, Joshh @.***> wrote:
@sdaranyi https://github.com/sdaranyi noted that "Looking at today's pulled latest TMI, there's something wrong with the structure. Column 1 has mostly motif numbers only with labels in Column 3, but in a number of rows both the motif number and the label are present. Worse, based on this table you said there are 46222 individual motifs, whereas it has 54906 lines. Which one is the correct number? Chances are that this could change the proportion of motifs used in the ATU."
@sdaranyi https://github.com/sdaranyi , can you please provide examples where both the motif number and the label are present in the id field? That would help me to find the issue.
@sdaranyi https://github.com/sdaranyi , my tmi dataset still shows 46,222 unique values in the id field, and 46,230 rows in the dataset, with the following items being duplicated (and needing to be resolved):
image.png (view on web) https://github.com/j-hagedorn/trilogy/assets/7065685/54639ee8-f301-48de-85e4-c6f3c12da2e5
@sdaranyi https://github.com/sdaranyi and @salmonix https://github.com/salmonix , please identify any additional data cleaning issues you may find related to the tmi dataset in the comments below, and I will try to resolve those quickly, if they are minor, rather than opening new issues to track them.
— Reply to this email directly, view it on GitHub https://github.com/j-hagedorn/trilogy/issues/41, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZDKNVPXQ76FKTNEGMY4GLYNOTCXAVCNFSM6AAAAABBRBRXQOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DSOJTGQ3DGOA . You are receiving this because you were mentioned.Message ID: @.***>
Thanks, @sdaranyi
@sdaranyi this issue should be fixed too. I tested it in Excel and it works fine.
@sdaranyi noted that "Looking at today's pulled latest TMI, there's something wrong with the structure. Column 1 has mostly motif numbers only with labels in Column 3, but in a number of rows both the motif number and the label are present. Worse, based on this table you said there are 46222 individual motifs, whereas it has 54906 lines. Which one is the correct number? Chances are that this could change the proportion of motifs used in the ATU."
@sdaranyi , can you please provide examples where both the motif number and the label are present in the
id
field? That would help me to find the issue.@sdaranyi , my
tmi
dataset still shows 46,222 unique values in theid
field, and 46,230 rows in the dataset, with the following items being duplicated (and needing to be resolved):@sdaranyi and @salmonix , please identify any additional data cleaning issues you may find related to the
tmi
dataset in the comments below, and I will try to resolve those quickly, if they are minor, rather than opening new issues to track them.