ngs-fzb / MTBseq_source

MTBseq is an automated pipeline for mapping, variant calling and detection of resistance mediating and phylogenetic variants from illumina whole genome sequence data of Mycobacterium tuberculosis complex isolates.
Other
41 stars 22 forks source link

Matrix file created by TBgroup #73

Closed ryobon-dev closed 2 years ago

ryobon-dev commented 2 years ago

Hi, I performed TBGroup using the data from 397 strains. It worked well but I found out that the first row of the matrix file table is not displayed. Could you please check the attached matrix file ?

NONE_joint_cf4_cr4_fr75_ph4_samples_amended_u95_phylo_w12.matrix.xlsx

aspitaleri commented 2 years ago

Hi, try this python script on the matrix file:

import pandas as pd import sys from openpyxl import Workbook from openpyxl.utils.dataframe import dataframe_to_rows

header = pd.read_csv(sys.argv[1], sep='\t',index_col=False, names=['x'],header=None).iloc[:,0] names=list(header) names.insert(0, '')

df = pd.read_table(sys.argv[1], names=names, skiprows=0, index_col=0) df = pd.DataFrame(df) df.fillna(0,inplace=True)

file=sys.argv[1]

a=file.split('.')

out=a[0]+"_matrix.csv" outx=a[0]+"_matrix.xls" outxx=a[0]+"_matrix.xlsx"

df.to_csv(out,sep='\t')

this below write column up to 256 columns

df.to_excel(outx)

wb = Workbook()

ws = wb.active

for r in dataframe_to_rows(df, index=True, header=True): ws.append(r)

for cell in ws['A'] + ws[1]: cell.style = 'Pandas'

wb.save(outxx)

save it to clean_matrix.py and then

python3 clean_matrix file.matrix

ryobon-dev commented 2 years ago

Thanks for your quick response.

Sorry but it is extremely hard for me to run the python script. Could you please provide me an alternative approach ?

Anyway, I tried to run your script that I saved it as matrix.py. Since I've never used python script, I can't understand the error message in below.

(base) tomotada@tomotadanoMacBook-Pro Desktop % python matrix.py File "matrix.py", line 31 ws.append(r) ^ IndentationError: expected an indented block

aspitaleri commented 2 years ago

Try the file in attach rename to clean_matrix.py and then python clean_matrix.py file.matrix clean_matrix.txt

ryobon-dev commented 2 years ago

Thank you very much for your support. I tried the file "clean_matrix.txt" but got the message below. I may need to change your script to fit my matrix file but I have no idea how to do it. For your information the file name of my matrix table is "NONE_joint_cf4_cr4_fr75_ph4_samples_amended_u95_phylo_w12.matrix" which is exactly the one obtained by TBgroup. Anyway, I installed pandas.

(base) tomotada@tomotadanoMacBook-Pro Desktop % python clean_matrix.txt

No xlsx will be generate. Only csv Traceback (most recent call last): File "clean_matrix.txt", line 10, in header = pd.read_csv(sys.argv[1], sep='\t',index_col=False, names=['x'],header=None).iloc[:,0] IndexError: list index out of range (base) tomotada@tomotadanoMacBook-Pro Desktop %

aspitaleri commented 2 years ago

you have to feed the script with the NONE_joint_cf4_cr4_fr75_ph4_samples_amended_u95_phylo_w12.matrix file:

python clean_matrix.txt NONE_joint_cf4_cr4_fr75_ph4_samples_amended_u95_phylo_w12.matrix

ryobon-dev commented 2 years ago

Thanks so much ! It works. I could generate the matrix file that I wanted to get.

Thank you very much again.

cutpatel commented 2 years ago

We only create half of the matrix as the other half is the same data. A relatively easy approach is to open the .matrix in EXCEL, insert a line at the top and copy and paste the first column with the transpose option.